[PPT] - Internet Engineering: Search Ali Kamandi Sharif University of PowerPoint Presentation

SLIDE 1

Internet Engineering: Search

Ali Kamandi Sharif University of Technology kamandi@ce.sharif.edu Spring 2007

SLIDE 2

2

Statistics

In 1994, one of the first web search engines, the

World Wide Web Worm (WWWW), had an index

f 110,000 web pages and web accessible

documents.

Up to 09/2005 Google indexes 8,200,000,000 web

pages.

SLIDE 3

3

Search Engines

A search engine is a system which collects,

rganizes & presents a way to select Web

documents based on certain words, phrases,

r patterns within documents

Model the Web as a full-text DB Index a portion of the Web docs Search Web documents using user-specified

words/patterns in a text

SLIDE 4

4

Search Engines

Two categories of search engine

general-purpose search engine, e.g. Yahoo!,

AltaVista and Google

special-purpose search engines (or Internet

Portals), e.g. LinuxStart (www.linuxstart.com)

SLIDE 5

5

Search Engines

Two main components of a search engine:

web crawler (spider), which collects massive Web

pages.

large database, which stores and indexes collected

Web pages. Ranking has to be performed without accessing the

text, just the index

Ranking algorithms: all information is “top secret;” it is

almost impossible to measure recall as the number of relevant pages can be quite large for simple queries

SLIDE 6

6

What’s Wrong with SQL (Search Quality) (1)

select * from content where body like ‘%running%’ select * from content where upper(body) like upper(‘%running%’)

SLIDE 7

7

What’s Wrong with SQL (Search Quality) (2)

select * from content where upper(body) like upper(‘%running shoes%’) select * from content where upper(body) like upper(‘%running%’) and upper(body) like upper(‘%shoes%’)

SLIDE 8

8

What’s Wrong with SQL (Search Quality) (3)

the more the user tells us about her interests

the fewer documents we’ll return in response to a search,

Note that public search engines circa 2005,

such as Google, Yahoo, A9, and MSN, do implicitly use AND

If there aren’t any rows with all query terms,

we should probably offer the user rows that contain some of the query terms.

SLIDE 9

9

Stemming

‘‘My brother-in-law Billy Bob ran 20 miles

yesterday’’

‘‘My cousin Gertrude runs 15 miles every

day.’’

stemming both the query terms and the

indexed terms

‘‘running,’’ ‘‘runs,’’ and ‘‘ran’’ would all be

bashed down to the stem word ‘‘run’’ for indexing and retrieval.

SLIDE 10

10

expanding queries through a thesaurus

‘‘I attended the 100th anniversary Boston

Marathon’’?

expanding queries through a thesaurus

powerful enough to make the connection between ‘‘running’’ and ‘‘marathon.’’

SLIDE 11

11

What’s Wrong with SQL (Performance) (1)

select * from content where body like ‘%running%’

The RDBMS must examine every row in the

content table to answer this query, scan (O[N] time, where N is the number of rows in the table)

SLIDE 12

12

What’s Wrong with SQL (Performance) (2)

Suppose that a standard RDBMS index is defined on the body

column. The values of body will be used as keys for a BTree

and we could perform select * from content where body = ‘running’

and maybe, depending on the implementation,

select * from content where body like ‘running%’

in O[log N] time.

SLIDE 13

13

Abandoning the RDBMS

We can solve both the performance and

search quality problems by dumping all of our data into a full-text search system.

these systems index every word in a

document, not just the first words as with the standard RDBMS B-tree.

A full-text index can answer the question

‘‘Find me the documents containing the word ‘running’ ’’ in time that approaches O[1], indexed.

SLIDE 14

14

Indexing

SLIDE 15

15

Constant Time

If there are 10 million documents in the corpus, a

search through those 10 million documents will not take much longer than a search through a corpus of 1,000 documents.

Getting close to constant time in this situation would

require that

the 10-million-document collection did not use a larger

vocabulary than the 1,000-document collection

and that it was not the case that, say, 90 percent of the

documents contained the word ‘‘running.’’

SLIDE 16

16

Stopwords

Word “the” O(N) stopwords, words that are too common to be

worth indexing.

For standard English, the stopword list

includes such words as ‘‘a,’’ ‘‘and,’’ ‘‘as,’’ ‘‘at,’’ ‘‘for,’’ ‘‘or,’’ ‘‘the,’’ and so forth.

SLIDE 17

17

Cost

Inserting a new document into the collection

will be slow. We’ll have to go through the document, word by word, and update as many rows in the index as there are distinct words in the document.

SLIDE 18

18

word-frequency histogram

suppose that there are 1,000 documents in

the collection containing ‘‘running shoes’’. Which are the most relevant to the user’s query of ‘‘running shoes’’?

We need a new data structure: the word-

frequency histogram.

which words occur in a document and how

frequently they occur

SLIDE 19

19

Example

‘‘All happy families resemble one another, but each unhappy family is unhappy in its

wn way,’

the first sentence of Tolstoy’s Anna Karenina:

SLIDE 20

20

More Refinements

After the crude histogram is made, it is typically

adjusted for the prevalence of words in standard

English. So, for example, the appearance of

‘‘resemble’’ is more interesting than ‘‘happy’’ because ‘‘resemble’’ occurs less frequently in standard English.

Stopwords such as ‘‘is’’ are thrown away altogether. Stemming is another useful refinement.

In the index and in queries ‘‘families,’’ ‘‘family’’

SLIDE 21

21

inter-document similarity

Given a body of histograms it is possible to

answer queries such as ‘‘Show me documents that are similar to this one’’ or ‘‘Show me documents whose histogram is closest to a user-entered string.’’

The inter-document similarity query can be

handled by comparing histograms already stored in the text database.

SLIDE 22

22

Working with the Public Search Engines

First, Google has to know about your server. This happens

either when someone already in the Google index links to your site or when you manually add your URL from a form off the google.com home page.

Second, Google has to be able to read the text on your server.

At least as of 2005 none of the public search engines implemented optical character recognition (OCR).

This means that text embedded in a GIF, Flash animation, or a

Java applet won’t be indexed.

Third, Google has to be able to get into all the pages on your

server.

Pages requiring registration won’t be indexed by Google

unless your software is smart enough to recognize that it is Google behind the request and make an exception.

SLIDE 23

23

Prevent Search Engine from Archiving

some search engines archive what they index prevent search engines from archiving the

page

placed in the HEAD of your HTML documents

robots are not guaranteed to follow such

directives.

SLIDE 24

24

Add Extra Keywords

in the online table of contents page for this book, we

have the following META tags in the HEAD: <meta name="keywords" content="web development

nline communities MIT 6.171 textbook">

<meta name="description" content="This is the textbook for the MIT course Software Engineering for Internet Applications">

The ‘‘keywords’’ tag adds some words that are

relevant to the document, but not present in the visible text.

SLIDE 25

25

Robot.txt

Standard for Web Exclusion, a protocol for

communication between Web publishers and Web crawlers

http://www.robotstxt.org/wc/norobots.html. /robots.txt, with instructions for robots. Example:

User-agent: * # let’s keep the robots away from our half-baked stuff Disallow: /staging

SLIDE 27

27

Mobile Users

Mobile Internet devices put an even greater

stress on information retrieval.

Connection speeds are slower. Screens are smaller. It isn’t practical for a user to drill down into 20

documents returned by a search engine as possibly relevant to a query, especially if the user is driving a car and using a voice browser.

SLIDE 28

28

Hint

Generally users prefer to browse rather than

search. If users are resorting to searches in
rder to get standard answers or perform

common tasks, there may be something wrong with a site’s navigation or information architecture.

SLIDE 29

29

A split-system approach to providing full-text search.

SLIDE 30

30

Extra Features

Who Links To You?

link:siteURL example:link:www.google.com

Web Page Translation

Google offers the following translation pairs:

English to and from Arabic, Chinese, French, German, Italian, Korean, Japanese, Spanish, and Portuguese; and German to and from French.

SLIDE 31

31

Extra Features

Spell Checker Site Search

restrict your search to a specific site admission site:www.sharif.edu

Similar Pages Results Prefetching Image, Music and Movie search

SLIDE 32

32

Into the GYM

Google

Still the most popular search engine

Yahoo!

Gaining on Google

MSN (now Live Search)

Love them or hate them, it’s Microsoft

SLIDE 33

33

Important Alternatives

Ask (Jeeves retired)

Now owned by Barry Diller (IAC)

Exalead

Developed in France, now has US office

A9

Part of Amazon

Gigablast

Added blog search

SLIDE 34

34

Whatever Happened To..

AltaVista AllTheWeb WiseNut Teoma Northern Light

SLIDE 35

35

September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.

Size Wars

August 2005 : We index 20 billion documents.

So, who’s right?

SLIDE 36

36

Share Of Searches: July 2006

5.6 billion searches in this month

SLIDE 37

37

how the share of searches has changed ?

SLIDE 38

38

Google

Google's founders Larry Page and Sergey

Brin

"Googol" is the mathematical term for a 1

followed by 100 zeros.

Google's play on the term reflects the

company's mission to organize the immense amount of information available on the web.

SLIDE 39

39

References

Chapter 12: Software Engineering for Internet Applications by Eve Andersson, Philip Greenspun, and Andrew Grumet; revised February 2005 netratingsforsearchenginewatch.com

SLIDE 40

40