Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu - - PDF document

internet search
SMART_READER_LITE
LIVE PREVIEW

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu - - PDF document

Internet Search (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Nazli Goharian, 2005, 2012 1 Outline Web: Indexing & Efficiency Partitioned Indexing Index Tiering & other early termination techniques Index in


slide-1
SLIDE 1

1

1

Internet Search

(COSC 488) Nazli Goharian

nazli@cs.georgetown.edu

Nazli Goharian, 2005, 2012 2

Outline

  • Web: Indexing & Efficiency

– Partitioned Indexing – Index Tiering & other early termination techniques – Index in Dynamic Environment

  • Improving effectiveness of Web search engines

– Web page ranking

  • Query log, anchor text, authority/hub, page rank, sponsored search, localized search,

social search

– Result snippets

  • Social Search

– tagging, collaborative search/filtering, recommender system – Real-time search

  • Peer-to-Peer Search

2

slide-2
SLIDE 2

2

3

The Web

  • Document collections are scattered across many

geographical areas.

  • Constraints prohibiting the centralization of data

include:

– Data security – Volume – Rate of change – Political and legal constraints – Other proprietary motivations

4

Web Search

  • Parallel and distributed processing
  • Web search tools access data distributed on servers

worldwide but indexed centrally.

  • Most of these systems have a partitioned index on

large clusters of servers with a centralized control.

  • They store pointers in the form of hypertext links to

various Web servers.

slide-3
SLIDE 3

3

5

Partitioned Indexing

  • Partitioning of index across multiple machines, based
  • n either:
  • Terms (Global index organization)
  • Each node holds posting list for some terms
  • Using content-index, query terms sent to nodes having the terms
  • Higher concurrency level, but larger postings lists
  • Documents (Local index organization)– more common
  • Each node holds a complete index (shorter PLs)
  • Query terms sent to all nodes
  • Top k results from each node merged
  • Global statistics (e.g.. idf) must be calculated
  • A Hybrid approach in Tiered Indexing may be used

6

Index Tiering

  • A popular early termination technique to improve

the efficiency of query processing

  • Dividing nodes into two tiers to allocate the index
  • f most popular documents on tier 1 and the rest on

tier 2.

  • Search tier 1 first, if not enough results then search

tier 2.

  • Note: other popular early termination techniques (top-doc and query pruning)

were discussed earlier in the semester!

6

slide-4
SLIDE 4

4

Distributed Index Construction

  • Not possible on a single machine
  • Various architecture for distributed indexing
  • MapReduce architecture (a term-partitioned

index)

  • Master node assigns tasks to worker nodes (map

workers & reduce workers) to split up the computing jobs:

  • Map Phase: Parsing & building localized <term, doc> pairs
  • Reduce Phase: Combining/merging posting pairs for each term

7

MapReduce (Cont’d)

  • Map & reduce phases can be done in parallel on many machines
  • A map machine can be a reducer machine in the process
  • Data broken into pieces (shards)…generally 16M-64 M [128M]

and send to map workers as they finish their job

  • Map workers work on one shard at a time (generally), unless

having more than one CPU, parse and generate <term,doc> pair (can be combined to <term,doc,tf>

  • Sort based on term, and then secondary key (doc_id)
  • The same keys (terms) are assigned to the same reduce worker
  • Load should be balanced on the reducers

8

slide-5
SLIDE 5

5

MapReduce (Cont’d)

9

Taken from: C. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008

10

Query Servers

  • Each server has its own disk holding a portion of

index

  • Queries are distributed, via a centralized control, to

servers that contain the related posting lists

  • Common terms may map to many servers
  • No single point of resource contention (efficient)
  • If a server crashes, that portion of index is not

available

slide-6
SLIDE 6

6

11

Index in Dynamic Environment

  • Data collection is not static
  • Reconstruct the index periodically from scratch

(many search engines use this)

  • Maintain an auxiliary index to store new document

& remerge with existing index

  • Maintain multiple indexes - complicated in

maintaining collection statistics

12

Outline

  • Web: Indexing & Efficiency

– Partitioned Indexing – Index Tiering & other early termination techniques – Index in Dynamic Environment

  • Improving effectiveness of Web search engines

– Web page ranking

  • Query log, anchor text, authority/hub, page rank, sponsored search, localized search,

social seacrh

– Result snippets

  • Social Search

– tagging, collaborative search/filtering, recommender system – Real-time search

  • Peer-to-Peer Search

12

slide-7
SLIDE 7

7

Definitions….

  • Web graph: each page is a node and links are directed edges

from one node to other node

  • Out-links (out-degree) A: links from page A to B
  • In-links (in-degree) A: links from other pages to A
  • Sink: if out-links = 0
  • Source: if in-links=0
  • Static page: pages that are generated prior to any request
  • Dynamic page: pages that generated as the result of a request
  • Hidden/deep web: pages with no links/password protected/via a

Form,…

  • Indexable Web: union of pages indexed by major search engines

13 14

Evaluation of Web Search Engines: High Precision Search

  • Traditional IR systems are evaluated based on

precision and recall.

  • Web search engines are evaluated based on top N

documents.

  • Recall estimation is very difficult
  • Precision is of limited concern, as many users do not look

beyond 1st screen. => How fast and accurate the first results screen is generated?

slide-8
SLIDE 8

8

15

Web Page Ranking

  • Considering both query dependant and query

independent scores (captured during indexing), a global score is generated for each page:

  • Query dependant score
  • Similarity measures such as Cosine, BM25, proximity,…
  • Query independent score
  • Link analysis (anchor text, popularity metrics such as:

authorities and hub, page rank,…)

  • Sponsored search
  • Localized search
  • Query log analysis
  • etc.

15 16

Query Log Analysis

  • Using user query patterns on certain days and

time of day, week, month, and year, many

  • ptimizations are possible:
  • Pre-cache likely Web pages in anticipation of user

queries to reduce page access delays; increasing system throughput (efficiency optimization)

  • Possible to adjust relevance ranking to tune for

certain user queries (accuracy optimization)

slide-9
SLIDE 9

9

17

Anchor Text

  • Short, 2-3 terms, describe the linked/destination page.
  • May/may not be a different point of view than the

author’s.

  • Anchor text of links to a doc di included in index for di
  • Extended anchor text (text surrounding anchor text) may

also be used

  • Generally weighted based on frequency (notion of idf)
  • Spamming problem

18

Page Rank

  • A scoring mechanism in Web search (trade marked by Google and patented by Stanford)
  • Generally calculated at the time of crawling
  • Using incoming and outgoing links as an indicator of

popularity, adjusts Web page score

  • Popular page is defined as a page that
  • Many Web pages link to it (inlinks)
  • Important (popular) pages link to it
  • May be affected by link spam
slide-10
SLIDE 10

10

19

Page Rank

+ − =

n

D D i i

D C D PageRank d N d A PageRank

...

1

) ( ) ( ) 1 ( ) ( C(Di) : number of links out from page Di d : damping factor (from 0-1; commonly 0.85) N: total number of pages

An Iterative Algorithm:

Initially all pages are assigned an arbitrary page rank (1/n), summing to 1 Iteratively calculate the scores until the new scores do not change significantly To converge faster, may initialize page ranks based on number of inlinks, log info,….

20

Authorities and Hub

  • Various algorithms based on assigning each retrieved

web page two scores: Authority and Hub scores. (HITS: Hyperlink-Induced Topic Search, 1999)

  • Authority page: an authoritative source on a given topic
  • Hub page: page listing pointers to authority pages on a topic
  • Authority score: summation of scores of all the hubs pointing

to that authority page

  • Hub score: summation of scores of all authority pages the hub

is pointing to

slide-11
SLIDE 11

11

21

Computing Authority and Hub Scores

  • Retrieve all pages containing the query term t. This is

called root set. (~200 pgs)

  • Create a set including union of root set pages, pages that

point to root set pages, and pages that root set pages point

  • to. This is called base set.
  • Using the base set to compute the hub and authority

scores.

  • An iterative algorithm:
  • Initialize hubs and authorities a score of 1
  • Update s(H) and s(A)

22

Sponsored Search

  • Search system vendors sell advertisers keywords so

that whenever such words are issued in a query, the advertiser’s desired homepage link is returned.

  • Sponsored search results are biased towards

advertisers with higher bids, click frequency of Ads,…

  • Significant revenue is generated to search engine

vendors via such search approach (ex.: per click (50 sents to

15 dollars)

slide-12
SLIDE 12

12

23

Sponsored Search

  • Search engines maintain an advertisement database

(Description of advertisement, link to that page, bids, popularity,…)

  • Searching the advertisement database for a match to:
  • query terms
  • keywords extracted from retrieved result page

(pseudo-relevance feedback, page features, …)

  • Ranking advertisements based on bids (on keywords)

and advertisement popularity (using clickthrough data logs)

24

Localized Search

  • Using geographic information to modify the ranking of

results (in addition to SC scores, link based scores,…).

  • Geographic information maybe derived from:
  • Location of device sending the query
  • Context of query
  • restaurant near Al Capone’s home’s town
  • restaurant Near White Sox stadium
  • Geographic location in the query
  • Chicago restaurants
  • Geographic location in a document metadata

24

slide-13
SLIDE 13

13

25

Result Snippets

  • Providing users a short summary (snippet) of page (title,

url, link to cached page, snippet).

  • Static snippets
  • Query independent
  • Created at indexing time and cached
  • Containing title, n number of sentences/words, (NLP can be used)
  • Dynamic snippets
  • Query dependent
  • Created at the time of results scoring
  • “Windows” of the document - also called KWIC (keyword in context)

25 26

Result Snippets

  • Index maintains sentence level information
  • Snippet sentences can be picked:
  • Based on query term(s):
  • heading
  • Location in document (nth sentence)
  • Closeness of query terms in sentence
  • Ratio of query terms in sentence
  • Unique query terms in sentence
  • From page metadata

26

slide-14
SLIDE 14

14

27

Result Snippets

  • An effective snippet should:(Clarke et al 2007’s clickthrough analysis)

– have all the query terms (unless already included in title) – Use the page metadata, if needed – Display URL and mark the query terms – Provide meaningful snippets vs. only some keywords

27 28

Outline

  • Web: Indexing & Efficiency

– Partitioned Indexing – Index Tiering & other early termination techniques – Index in Dynamic Environment

  • Improving effectiveness of Web search engines

– Web page ranking

  • Query log, anchor text, authority/hub, page rank, sponsored search, localized search,

social search

– Result snippets

  • Social Search

– tagging, collaborative search/filtering, recommender system – Real-time search

  • Peer-to-Peer Search

28

slide-15
SLIDE 15

15

29

Social Search

  • Social search introduces new aspects to search engines

– Village paradigm (Collaborative) [Horowitz & Kamvar, WWW’10]

  • Crowd/ Social network /friends vs. Corpus-based
  • Routing questions to potential answerers
  • Community of users, sharing goal or interest,

participate in search and interact with each other online

– YouTube, Twitter, Flickr, Facebook, Myspace, LinkedIn, forums, blogs, online games,…

From Wikipedia: “Social search or a social search engine is a type of web search that takes into account the Social Graph of the person initiating the search query. When applied to web search this Social-Graph approach to relevance is in contrast to established algorithmic or machine-based approaches where relevance is determined by analyzing the text of each document or the link structure of the documents”

29 30

Real-Time Search

  • Traditional search indexes the crawled pages
  • Real-time search results of search engines such as

Google, Bing, Yahoo come from variety of real-time search services such as twitter, flicker, your-tube, etc.

– Receive data directly from various social media and blogs (subscribed to social networking sites) – A filtering engine identifies spams – Measuring relevance -- The ranking is based on:

  • Time,
  • relevance to query,
  • number of followers of authors,
  • reputation of a link defined by the frequency of forwarding (re-tweets), …

First real time search: “Summize” in 2007 with real time trend analysis– later

  • n merged with twitter 2008)

30

slide-16
SLIDE 16

16

31

Social Search

  • Documents or websites are deemed relevant if

searcher’s social network were also interested in it.

  • Nature of queries

– Many cases opinionated, subjective – Query length (Many cases longer queries than Web’s)

  • Index

– Storing user’s behavior ( responsiveness, answer quality, expertise) – Mapping users to topics

31 32

Social Search

  • Social Search Ranking – based on combination of:

– Query-dependent (prob. of a good answer to query q by user u)

  • Similarity of results to query (various ranking: cosine, bm25,

proxomity,…)

  • Relatedness of query/results to user

– Query-independent

  • How many users bookmarked x
  • Social Trust
  • Similarity of asker to answerer -- user profiles similarity, users’

connectedness

32

slide-17
SLIDE 17

17

33

Social Search

  • Mapping users to topics. An example: [Horowitz &

Kamvar, WWW’10]

– User specifies interest /expertise in topics – Friends of users indicate the expertise of user u in topics – Automatically identified topics from

  • User’s existing online profiles
  • User’s homepages, blogs
  • User’s status messages (Twitter, Facebook, IM,…)

33 34

Social Search

  • Measuring connectedness using cosine similarity
  • ver various features, such as: [Horowitz & Kamvar, WWW’10]

– Social connection (common friends and affiliations) – Demographic similarity – Profile similarity (e.g., common favorite movies) – Vocabulary match (e.g., IM shortcuts) – Chattiness match (frequency of follow-up messages) – Verbosity match (the average length of messages) – Politeness match (e.g., use of “Thanks!”) – Speed match (responsiveness to other users)

34

slide-18
SLIDE 18

18

35

Social Search

Sample approach: [Karweg, et.al, CIKM’ 11]:

  • Social Relevance Score (SRS) ranks the result elements of a query

according to their social relevance for the user. It is calculated based on 2 factors: – Engagement Intensity: how intense the users interacted with the result

  • Engagement: Interaction in terms of recommendation, rating, status messages
  • Intensity: effort of textual feedback vs. rating score /thumps up

– Trust Score : level of trust to those who recommend a link

  • Assigned by users & refined by social network analysis using page-rank on

social graph

35

=

i

E x x s

i e x t i SRS ) ( ). ( ) (

  • SRS(i): social rank score of document/page i
  • X: a user in social network interacted/recommended page i

36

Social Search -- Trust

  • Trust has been discussed for years in sociology and social

psychology

  • [Marsh , Ph.D. dissertation,1994] formalized trust as a computational

concept (agents that keep history of behaviors)

  • Trust in peer-to-peer, EigenTrust [Kamvar et al. 2004] (corrupt vs.

valid files)

  • Various efforts in formalization of trust in recommender systems

and social network [Swearingen and Sinha ,2001], [Ziegler and Golbeck [2006].

– The more similar two people were, the greater the trust between them [Ziegler

and Golbeck [2006].

– “Trust in a person is a commitment to an action based on a belief that the future actions of that person will lead to a good outcome.” Example: “Alice trusts Bob regarding email if she chooses to read a message (commits to an action)

that Bob sends her (based on her belief that Bob will not waste her time”

slide-19
SLIDE 19

19

37

Tagging

  • Social media sites allow users to tag the data
  • User tags act as manual indexing of data – in addition

to automatic indexing

  • User tags serve as folksonomy
  • Tags are used to organize and search data
  • Challenges with the tagged data:

– Vocabulary mismatch – Noisy or Spam tags – Missing tags

37 38

Searching Tagged Data:

Vocabulary Mismatch problem

  • Tag keywords describe textual or non-textual data

and are used to search for items

– Tags are very sparse (only few keywords) – Boolean (conjunctive, disjunctive) search can lead to high precision/low recall or high recall/low precision – To reduce the vocabulary mismatch perform stemming, or pseudo-relevance feedback

38

slide-20
SLIDE 20

20

39

Searching Tagged Data:

Noisy and Spam Tags

  • Spam/misspelled/non-relevant tags mislead search

– Some incentive must be provided to users to report spam tags, and to enter good quality tags. – Log and statistical information may help to identify spam tags

39 40

Searching Tagged Data:

Missing Tags Automatically generate tags for items with missing tags, using:

– Term weight of textual representation of item – Classification of item to a label (i.e.. Tag)

40

slide-21
SLIDE 21

21

41

Tag Clouds

  • The most popular tags are represented to users to

provide a more wide view of collection

  • Tag cloud displays the tags as a weighted list
  • The font size is proportional to the weight

Thanks to: www.wordle.net for tagcloud generator & F. Silvestri, CNR, Italy, S. Orlando, U. of Venice, Italy

41

Recommender Systems