Internet Engineering: Search Ali Kamandi Sharif University of - - PowerPoint PPT Presentation
Internet Engineering: Search Ali Kamandi Sharif University of - - PowerPoint PPT Presentation
Internet Engineering: Search Ali Kamandi Sharif University of Technology kamandi@ce.sharif.edu Spring 2007 Statistics In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web
2
Statistics
In 1994, one of the first web search engines, the
World Wide Web Worm (WWWW), had an index
- f 110,000 web pages and web accessible
documents.
Up to 09/2005 Google indexes 8,200,000,000 web
pages.
3
Search Engines
A search engine is a system which collects,
- rganizes & presents a way to select Web
documents based on certain words, phrases,
- r patterns within documents
Model the Web as a full-text DB Index a portion of the Web docs Search Web documents using user-specified
words/patterns in a text
4
Search Engines
Two categories of search engine
general-purpose search engine, e.g. Yahoo!,
AltaVista and Google
special-purpose search engines (or Internet
Portals), e.g. LinuxStart (www.linuxstart.com)
5
Search Engines
Two main components of a search engine:
web crawler (spider), which collects massive Web
pages.
large database, which stores and indexes collected
Web pages. Ranking has to be performed without accessing the
text, just the index
Ranking algorithms: all information is “top secret;” it is
almost impossible to measure recall as the number of relevant pages can be quite large for simple queries
6
What’s Wrong with SQL (Search Quality) (1)
select * from content where body like ‘%running%’ select * from content where upper(body) like upper(‘%running%’)
7
What’s Wrong with SQL (Search Quality) (2)
select * from content where upper(body) like upper(‘%running shoes%’) select * from content where upper(body) like upper(‘%running%’) and upper(body) like upper(‘%shoes%’)
8
What’s Wrong with SQL (Search Quality) (3)
the more the user tells us about her interests
the fewer documents we’ll return in response to a search,
Note that public search engines circa 2005,
such as Google, Yahoo, A9, and MSN, do implicitly use AND
If there aren’t any rows with all query terms,
we should probably offer the user rows that contain some of the query terms.
9
Stemming
‘‘My brother-in-law Billy Bob ran 20 miles
yesterday’’
‘‘My cousin Gertrude runs 15 miles every
day.’’
stemming both the query terms and the
indexed terms
‘‘running,’’ ‘‘runs,’’ and ‘‘ran’’ would all be
bashed down to the stem word ‘‘run’’ for indexing and retrieval.
10
expanding queries through a thesaurus
‘‘I attended the 100th anniversary Boston
Marathon’’?
expanding queries through a thesaurus
powerful enough to make the connection between ‘‘running’’ and ‘‘marathon.’’
11
What’s Wrong with SQL (Performance) (1)
select * from content where body like ‘%running%’
The RDBMS must examine every row in the
content table to answer this query, scan (O[N] time, where N is the number of rows in the table)
12
What’s Wrong with SQL (Performance) (2)
Suppose that a standard RDBMS index is defined on the body
- column. The values of body will be used as keys for a BTree
and we could perform select * from content where body = ‘running’
and maybe, depending on the implementation,
select * from content where body like ‘running%’
in O[log N] time.
13
Abandoning the RDBMS
We can solve both the performance and
search quality problems by dumping all of our data into a full-text search system.
these systems index every word in a
document, not just the first words as with the standard RDBMS B-tree.
A full-text index can answer the question
‘‘Find me the documents containing the word ‘running’ ’’ in time that approaches O[1], indexed.
14
Indexing
15
Constant Time
If there are 10 million documents in the corpus, a
search through those 10 million documents will not take much longer than a search through a corpus of 1,000 documents.
Getting close to constant time in this situation would
require that
the 10-million-document collection did not use a larger
vocabulary than the 1,000-document collection
and that it was not the case that, say, 90 percent of the
documents contained the word ‘‘running.’’
16
Stopwords
Word “the” O(N) stopwords, words that are too common to be
worth indexing.
For standard English, the stopword list
includes such words as ‘‘a,’’ ‘‘and,’’ ‘‘as,’’ ‘‘at,’’ ‘‘for,’’ ‘‘or,’’ ‘‘the,’’ and so forth.
17
Cost
Inserting a new document into the collection
will be slow. We’ll have to go through the document, word by word, and update as many rows in the index as there are distinct words in the document.
18
word-frequency histogram
suppose that there are 1,000 documents in
the collection containing ‘‘running shoes’’. Which are the most relevant to the user’s query of ‘‘running shoes’’?
We need a new data structure: the word-
frequency histogram.
which words occur in a document and how
frequently they occur
19
Example
‘‘All happy families resemble one another, but each unhappy family is unhappy in its
- wn way,’
the first sentence of Tolstoy’s Anna Karenina:
20
More Refinements
After the crude histogram is made, it is typically
adjusted for the prevalence of words in standard
- English. So, for example, the appearance of
‘‘resemble’’ is more interesting than ‘‘happy’’ because ‘‘resemble’’ occurs less frequently in standard English.
Stopwords such as ‘‘is’’ are thrown away altogether. Stemming is another useful refinement.
In the index and in queries ‘‘families,’’ ‘‘family’’
21
inter-document similarity
Given a body of histograms it is possible to
answer queries such as ‘‘Show me documents that are similar to this one’’ or ‘‘Show me documents whose histogram is closest to a user-entered string.’’
The inter-document similarity query can be
handled by comparing histograms already stored in the text database.
22
Working with the Public Search Engines
First, Google has to know about your server. This happens
either when someone already in the Google index links to your site or when you manually add your URL from a form off the google.com home page.
Second, Google has to be able to read the text on your server.
At least as of 2005 none of the public search engines implemented optical character recognition (OCR).
This means that text embedded in a GIF, Flash animation, or a
Java applet won’t be indexed.
Third, Google has to be able to get into all the pages on your
server.
Pages requiring registration won’t be indexed by Google
unless your software is smart enough to recognize that it is Google behind the request and make an exception.
23
Prevent Search Engine from Archiving
some search engines archive what they index prevent search engines from archiving the
page
<META NAME="ROBOTS“ CONTENT="NOARCHIVE">
placed in the HEAD of your HTML documents
robots are not guaranteed to follow such
directives.
24
Add Extra Keywords
in the online table of contents page for this book, we
have the following META tags in the HEAD: <meta name="keywords" content="web development
- nline communities MIT 6.171 textbook">
<meta name="description" content="This is the textbook for the MIT course Software Engineering for Internet Applications">
The ‘‘keywords’’ tag adds some words that are
relevant to the document, but not present in the visible text.
25
Tags
These tags have been routinely abused. A publisher
might add popular search terms to a site that is unrelated to those terms, in hopes of capturing more readers.
A company might add the names of its competitors
as keywords.
Users wouldn’t see these dirty tricks unless they went to the trouble of using the View
Source command in their browser.
Because of this history of abuse, many public search
engines ignore these tags.
26
Robot.txt
Standard for Web Exclusion, a protocol for
communication between Web publishers and Web crawlers
http://www.robotstxt.org/wc/norobots.html. /robots.txt, with instructions for robots. Example:
User-agent: * # let’s keep the robots away from our half-baked stuff Disallow: /staging
27
Mobile Users
Mobile Internet devices put an even greater
stress on information retrieval.
Connection speeds are slower. Screens are smaller. It isn’t practical for a user to drill down into 20
documents returned by a search engine as possibly relevant to a query, especially if the user is driving a car and using a voice browser.
28
Hint
Generally users prefer to browse rather than
- search. If users are resorting to searches in
- rder to get standard answers or perform
common tasks, there may be something wrong with a site’s navigation or information architecture.
29
A split-system approach to providing full-text search.
30
Extra Features
Who Links To You?
link:siteURL example:link:www.google.com
Web Page Translation
Google offers the following translation pairs:
English to and from Arabic, Chinese, French, German, Italian, Korean, Japanese, Spanish, and Portuguese; and German to and from French.
31
Extra Features
Spell Checker Site Search
restrict your search to a specific site admission site:www.sharif.edu
Similar Pages Results Prefetching Image, Music and Movie search
32
Into the GYM
Still the most popular search engine
Yahoo!
Gaining on Google
MSN (now Live Search)
Love them or hate them, it’s Microsoft
33
Important Alternatives
Ask (Jeeves retired)
Now owned by Barry Diller (IAC)
Exalead
Developed in France, now has US office
A9
Part of Amazon
Gigablast
Added blog search
34
Whatever Happened To..
AltaVista AllTheWeb WiseNut Teoma Northern Light
35
September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.
Size Wars
August 2005 : We index 20 billion documents.
So, who’s right?
36
Share Of Searches: July 2006
5.6 billion searches in this month
37
how the share of searches has changed ?
38
Google's founders Larry Page and Sergey
Brin
"Googol" is the mathematical term for a 1
followed by 100 zeros.
Google's play on the term reflects the
company's mission to organize the immense amount of information available on the web.
39
References
Chapter 12: Software Engineering for Internet Applications by Eve Andersson, Philip Greenspun, and Andrew Grumet; revised February 2005 netratingsforsearchenginewatch.com
40