[PDF] - Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu PDF Document

SLIDE 1

1

Internet Search

(COSC 488) Nazli Goharian

nazli@cs.georgetown.edu

Nazli Goharian, 2005, 2012 2

Outline

Web: Indexing & Efficiency

– Partitioned Indexing – Index Tiering & other early termination techniques – Index in Dynamic Environment

Improving effectiveness of Web search engines

– Web page ranking

Query log, anchor text, authority/hub, page rank, sponsored search, localized search,

social search

– Result snippets

Social Search

– tagging, collaborative search/filtering, recommender system – Real-time search

Peer-to-Peer Search

2

SLIDE 2

2

3

The Web

Document collections are scattered across many

geographical areas.

Constraints prohibiting the centralization of data

include:

– Data security – Volume – Rate of change – Political and legal constraints – Other proprietary motivations

4

Web Search

Parallel and distributed processing
Web search tools access data distributed on servers

worldwide but indexed centrally.

Most of these systems have a partitioned index on

large clusters of servers with a centralized control.

They store pointers in the form of hypertext links to

various Web servers.

SLIDE 3

3

5

Partitioned Indexing

Partitioning of index across multiple machines, based
n either:
Terms (Global index organization)
Each node holds posting list for some terms
Using content-index, query terms sent to nodes having the terms
Higher concurrency level, but larger postings lists
Documents (Local index organization)– more common
Each node holds a complete index (shorter PLs)
Query terms sent to all nodes
Top k results from each node merged
Global statistics (e.g.. idf) must be calculated
A Hybrid approach in Tiered Indexing may be used

6

Index Tiering

A popular early termination technique to improve

the efficiency of query processing

Dividing nodes into two tiers to allocate the index
f most popular documents on tier 1 and the rest on

tier 2.

Search tier 1 first, if not enough results then search

tier 2.

Note: other popular early termination techniques (top-doc and query pruning)

were discussed earlier in the semester!

6

SLIDE 4

4

Distributed Index Construction

Not possible on a single machine
Various architecture for distributed indexing
MapReduce architecture (a term-partitioned

index)

Master node assigns tasks to worker nodes (map

workers & reduce workers) to split up the computing jobs:

Map Phase: Parsing & building localized <term, doc> pairs
Reduce Phase: Combining/merging posting pairs for each term

7

MapReduce (Cont’d)

Map & reduce phases can be done in parallel on many machines
A map machine can be a reducer machine in the process
Data broken into pieces (shards)…generally 16M-64 M [128M]

and send to map workers as they finish their job

Map workers work on one shard at a time (generally), unless

having more than one CPU, parse and generate <term,doc> pair (can be combined to <term,doc,tf>

Sort based on term, and then secondary key (doc_id)
The same keys (terms) are assigned to the same reduce worker
Load should be balanced on the reducers

8

SLIDE 5

5

MapReduce (Cont’d)

9

Taken from: C. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008

10

Query Servers

Each server has its own disk holding a portion of

index

Queries are distributed, via a centralized control, to

servers that contain the related posting lists

Common terms may map to many servers
No single point of resource contention (efficient)
If a server crashes, that portion of index is not

available

SLIDE 6

6

11

Index in Dynamic Environment

Data collection is not static
Reconstruct the index periodically from scratch

(many search engines use this)

Maintain an auxiliary index to store new document

& remerge with existing index

Maintain multiple indexes - complicated in

maintaining collection statistics

12

Outline

Web: Indexing & Efficiency

– Partitioned Indexing – Index Tiering & other early termination techniques – Index in Dynamic Environment

Improving effectiveness of Web search engines

– Web page ranking

Query log, anchor text, authority/hub, page rank, sponsored search, localized search,

social seacrh

– Result snippets

Social Search

– tagging, collaborative search/filtering, recommender system – Real-time search

Peer-to-Peer Search

12

SLIDE 7

7

Definitions….

Web graph: each page is a node and links are directed edges

from one node to other node

Out-links (out-degree) A: links from page A to B
In-links (in-degree) A: links from other pages to A
Sink: if out-links = 0
Source: if in-links=0
Static page: pages that are generated prior to any request
Dynamic page: pages that generated as the result of a request
Hidden/deep web: pages with no links/password protected/via a

Form,…

Indexable Web: union of pages indexed by major search engines

13 14

Evaluation of Web Search Engines: High Precision Search

Traditional IR systems are evaluated based on

precision and recall.

Web search engines are evaluated based on top N

documents.

Recall estimation is very difficult
Precision is of limited concern, as many users do not look

beyond 1st screen. => How fast and accurate the first results screen is generated?

SLIDE 8

8

15

Web Page Ranking

Considering both query dependant and query

independent scores (captured during indexing), a global score is generated for each page:

Query dependant score
Similarity measures such as Cosine, BM25, proximity,…
Query independent score
Link analysis (anchor text, popularity metrics such as:

authorities and hub, page rank,…)

Sponsored search
Localized search
Query log analysis
etc.

15 16

Query Log Analysis

Using user query patterns on certain days and

time of day, week, month, and year, many

ptimizations are possible:
Pre-cache likely Web pages in anticipation of user

queries to reduce page access delays; increasing system throughput (efficiency optimization)

Possible to adjust relevance ranking to tune for

certain user queries (accuracy optimization)

SLIDE 9

9

17

Anchor Text

Short, 2-3 terms, describe the linked/destination page.
May/may not be a different point of view than the

author’s.

Anchor text of links to a doc di included in index for di
Extended anchor text (text surrounding anchor text) may

also be used

Generally weighted based on frequency (notion of idf)
Spamming problem

18

Page Rank

A scoring mechanism in Web search (trade marked by Google and patented by Stanford)
Generally calculated at the time of crawling
Using incoming and outgoing links as an indicator of

popularity, adjusts Web page score

Popular page is defined as a page that
Many Web pages link to it (inlinks)
Important (popular) pages link to it
May be affected by link spam

SLIDE 10

10

19

Page Rank

∑

+ − =

n

D D i i

D C D PageRank d N d A PageRank

...

1

) ( ) ( ) 1 ( ) ( C(Di) : number of links out from page Di d : damping factor (from 0-1; commonly 0.85) N: total number of pages

An Iterative Algorithm:

Initially all pages are assigned an arbitrary page rank (1/n), summing to 1 Iteratively calculate the scores until the new scores do not change significantly To converge faster, may initialize page ranks based on number of inlinks, log info,….

20

Authorities and Hub

Various algorithms based on assigning each retrieved

web page two scores: Authority and Hub scores. (HITS: Hyperlink-Induced Topic Search, 1999)

Authority page: an authoritative source on a given topic
Hub page: page listing pointers to authority pages on a topic
Authority score: summation of scores of all the hubs pointing

to that authority page

Hub score: summation of scores of all authority pages the hub

is pointing to

SLIDE 11

11

21

Computing Authority and Hub Scores

Retrieve all pages containing the query term t. This is

called root set. (~200 pgs)

Create a set including union of root set pages, pages that

point to root set pages, and pages that root set pages point

to. This is called base set.
Using the base set to compute the hub and authority

scores.

An iterative algorithm:
Initialize hubs and authorities a score of 1
Update s(H) and s(A)

22

Localized Search

Using geographic information to modify the ranking of

24

SLIDE 13

13

25

Result Snippets

Providing users a short summary (snippet) of page (title,

url, link to cached page, snippet).

Static snippets
Query independent
Created at indexing time and cached
Containing title, n number of sentences/words, (NLP can be used)
Dynamic snippets
Query dependent
Created at the time of results scoring
“Windows” of the document - also called KWIC (keyword in context)

25 26

Result Snippets

Index maintains sentence level information
Snippet sentences can be picked:
Based on query term(s):
heading
Location in document (nth sentence)
Closeness of query terms in sentence
Ratio of query terms in sentence
Unique query terms in sentence
From page metadata

26

SLIDE 14

14

27

Result Snippets

An effective snippet should:(Clarke et al 2007’s clickthrough analysis)

– have all the query terms (unless already included in title) – Use the page metadata, if needed – Display URL and mark the query terms – Provide meaningful snippets vs. only some keywords

27 28

Outline

Web: Indexing & Efficiency

– Partitioned Indexing – Index Tiering & other early termination techniques – Index in Dynamic Environment

Improving effectiveness of Web search engines

– Web page ranking

Query log, anchor text, authority/hub, page rank, sponsored search, localized search,

social search

– Result snippets

Social Search

– tagging, collaborative search/filtering, recommender system – Real-time search

Peer-to-Peer Search

28

SLIDE 15

15

29

Social Search

Social search introduces new aspects to search engines

– Village paradigm (Collaborative) [Horowitz & Kamvar, WWW’10]

Crowd/ Social network /friends vs. Corpus-based
Routing questions to potential answerers
Community of users, sharing goal or interest,

participate in search and interact with each other online

– YouTube, Twitter, Flickr, Facebook, Myspace, LinkedIn, forums, blogs, online games,…

From Wikipedia: “Social search or a social search engine is a type of web search that takes into account the Social Graph of the person initiating the search query. When applied to web search this Social-Graph approach to relevance is in contrast to established algorithmic or machine-based approaches where relevance is determined by analyzing the text of each document or the link structure of the documents”

29 30

Real-Time Search

Traditional search indexes the crawled pages
Real-time search results of search engines such as

Google, Bing, Yahoo come from variety of real-time search services such as twitter, flicker, your-tube, etc.

– Receive data directly from various social media and blogs (subscribed to social networking sites) – A filtering engine identifies spams – Measuring relevance -- The ranking is based on:

Time,
relevance to query,
number of followers of authors,
reputation of a link defined by the frequency of forwarding (re-tweets), …

First real time search: “Summize” in 2007 with real time trend analysis– later

n merged with twitter 2008)

30

SLIDE 16

16

31

Social Search

Documents or websites are deemed relevant if

searcher’s social network were also interested in it.

Nature of queries

– Many cases opinionated, subjective – Query length (Many cases longer queries than Web’s)

Index

– Storing user’s behavior ( responsiveness, answer quality, expertise) – Mapping users to topics

31 32

Social Search

Social Search Ranking – based on combination of:

– Query-dependent (prob. of a good answer to query q by user u)

Similarity of results to query (various ranking: cosine, bm25,

proxomity,…)

Relatedness of query/results to user

– Query-independent

How many users bookmarked x
Social Trust
Similarity of asker to answerer -- user profiles similarity, users’

connectedness

32

SLIDE 17

17

33

Social Search

Mapping users to topics. An example: [Horowitz &

Kamvar, WWW’10]

– User specifies interest /expertise in topics – Friends of users indicate the expertise of user u in topics – Automatically identified topics from

User’s existing online profiles
User’s homepages, blogs
User’s status messages (Twitter, Facebook, IM,…)

33 34

Social Search

Measuring connectedness using cosine similarity
ver various features, such as: [Horowitz & Kamvar, WWW’10]

– Social connection (common friends and affiliations) – Demographic similarity – Profile similarity (e.g., common favorite movies) – Vocabulary match (e.g., IM shortcuts) – Chattiness match (frequency of follow-up messages) – Verbosity match (the average length of messages) – Politeness match (e.g., use of “Thanks!”) – Speed match (responsiveness to other users)

34

SLIDE 18

18

35

Social Search

Sample approach: [Karweg, et.al, CIKM’ 11]:

Social Relevance Score (SRS) ranks the result elements of a query

according to their social relevance for the user. It is calculated based on 2 factors: – Engagement Intensity: how intense the users interacted with the result

Engagement: Interaction in terms of recommendation, rating, status messages
Intensity: effort of textual feedback vs. rating score /thumps up

– Trust Score : level of trust to those who recommend a link

Assigned by users & refined by social network analysis using page-rank on

social graph

35

∑

∈

=

i

E x x s

i e x t i SRS ) ( ). ( ) (

SRS(i): social rank score of document/page i
X: a user in social network interacted/recommended page i

36

Social Search -- Trust

Trust has been discussed for years in sociology and social

psychology

[Marsh , Ph.D. dissertation,1994] formalized trust as a computational

concept (agents that keep history of behaviors)

Trust in peer-to-peer, EigenTrust [Kamvar et al. 2004] (corrupt vs.

valid files)

Various efforts in formalization of trust in recommender systems

and social network [Swearingen and Sinha ,2001], [Ziegler and Golbeck [2006].

– The more similar two people were, the greater the trust between them [Ziegler

and Golbeck [2006].

– “Trust in a person is a commitment to an action based on a belief that the future actions of that person will lead to a good outcome.” Example: “Alice trusts Bob regarding email if she chooses to read a message (commits to an action)

that Bob sends her (based on her belief that Bob will not waste her time”

SLIDE 19

19

37

Tagging

Social media sites allow users to tag the data
User tags act as manual indexing of data – in addition

to automatic indexing

User tags serve as folksonomy
Tags are used to organize and search data
Challenges with the tagged data:

– Vocabulary mismatch – Noisy or Spam tags – Missing tags

37 38

Searching Tagged Data:

Vocabulary Mismatch problem

Tag keywords describe textual or non-textual data

and are used to search for items

– Tags are very sparse (only few keywords) – Boolean (conjunctive, disjunctive) search can lead to high precision/low recall or high recall/low precision – To reduce the vocabulary mismatch perform stemming, or pseudo-relevance feedback

38

SLIDE 20

20

39

Searching Tagged Data:

Noisy and Spam Tags

Spam/misspelled/non-relevant tags mislead search

– Some incentive must be provided to users to report spam tags, and to enter good quality tags. – Log and statistical information may help to identify spam tags

39 40

Searching Tagged Data:

Missing Tags Automatically generate tags for items with missing tags, using:

– Term weight of textual representation of item – Classification of item to a label (i.e.. Tag)

40

SLIDE 21

21

41

1

Internet Search

(COSC 488) Nazli Goharian

nazli@cs.georgetown.edu

Outline

2

The Web

geographical areas.

include:

– Data security – Volume – Rate of change – Political and legal constraints – Other proprietary motivations

Web Search

worldwide but indexed centrally.

large clusters of servers with a centralized control.

various Web servers.

3

Partitioned Indexing

Index Tiering

the efficiency of query processing

tier 2.

tier 2.

4

Distributed Index Construction

index)

workers & reduce workers) to split up the computing jobs:

MapReduce (Cont’d)

and send to map workers as they finish their job

having more than one CPU, parse and generate <term,doc> pair (can be combined to <term,doc,tf>

5

MapReduce (Cont’d)

Query Servers

index

servers that contain the related posting lists

available

6

Index in Dynamic Environment

(many search engines use this)

& remerge with existing index

maintaining collection statistics

Outline

7

Definitions….

from one node to other node

Form,…

Evaluation of Web Search Engines: High Precision Search

precision and recall.

documents.

beyond 1st screen. => How fast and accurate the first results screen is generated?

8

Web Page Ranking

independent scores (captured during indexing), a global score is generated for each page:

authorities and hub, page rank,…)

Query Log Analysis

time of day, week, month, and year, many

queries to reduce page access delays; increasing system throughput (efficiency optimization)

certain user queries (accuracy optimization)

9

Anchor Text

author’s.

also be used

Page Rank

popularity, adjusts Web page score

10

Page Rank

∑

An Iterative Algorithm:

Authorities and Hub

web page two scores: Authority and Hub scores. (HITS: Hyperlink-Induced Topic Search, 1999)

to that authority page

is pointing to

11

Computing Authority and Hub Scores

called root set. (~200 pgs)

point to root set pages, and pages that root set pages point

scores.

Sponsored Search

that whenever such words are issued in a query, the advertiser’s desired homepage link is returned.

advertisers with higher bids, click frequency of Ads,…

vendors via such search approach (ex.: per click (50 sents to

12

Sponsored Search