INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/
IR 19/25: Web Search Basics and Classification
Paul Ginsparg
Cornell University, Ithaca, NY
9 Nov 2010
1 / 67
INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 19/25: Web Search Basics and Classification Paul Ginsparg Cornell University, Ithaca, NY 9 Nov 2010 1 / 67
Cornell University, Ithaca, NY
1 / 67
http://www.usenix.org/events/osdi04/tech/full papers/dean/dean.pdf
2 / 67
1
2
3
4
5
6
3 / 67
1
2
3
4
5
6
4 / 67
Easy to eliminate E.g., use hash/fingerprint
Abundant on the web Difficult to eliminate
5 / 67
2
6 / 67
7 / 67
Buy something: “MacBook Air” Download something: “Acrobat Reader” Chat with someone: “live soccer chat”
8 / 67
A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,S. Stata, A. Tomkins, and
Strongly connected component (SCC) in the center Lots of pages that get linked to, but don’t link (OUT) Lots of pages that link to other pages, but don’t get linked to (IN) Tendrils, tubes, islands # of in-links (in-degree) averages 8–15, not randomly distributed (Poissonian), instead a power law: # pages with in-degree i is ∝ 1/iα, α ≈ 2.1
9 / 67
N! (N−m)! = N(N − 1) · · · (N − m + 1) ≈ Nm,
m
N! m!(N−m)! ≈ Nm m! , and
N→∞
m! is known as a Poisson distribution
m=0 p(m) = e−µ ∞ m=0 µm m! = e−µ ·eµ = 1).
10 / 67
m!
0.02 0.04 0.06 0.08 0.1 0.12 0.14 5 10 15 20 25 30
11 / 67
m!
0.02 0.04 0.06 0.08 0.1 0.12 0.14 10 20 30 40 50 60 70 80 90 100
12 / 67
m!
1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 1 10 100 1000 10000
13 / 67
Server (nytimes.com → New York) Web page (nytimes.com article about Albania) User (located in Palo Alto)
IP address Information provided by user (e.g., in user profile) Mobile phone
Example: East Palo Alto CA → Latitude: 37.47 N, Longitude: 122.14 W Important NLP problem
14 / 67
1
2
3
4
5
6
15 / 67
16 / 67
17 / 67
18 / 67
19 / 67
20 / 67
21 / 67
22 / 67
Newly registered domains (domain flooding) A set of pages that all point to each other to boost each
Pay somebody to put your link on their highly ranked page (“schuetze horoskop” example) Leave comments that include the link on blogs
23 / 67
For example, Google bombs like Who is a failure?
Restructure your content in a way that makes it easy to index Talk with influential bloggers and have them link to your site Add more interesting and original content
24 / 67
Links, statistically analyzed (PageRank etc) Usage (users visiting a page) No adult content (e.g., no pictures with flesh-tone) Distribution and structure of text (e.g., no keyword stuffing)
Blacklists Top queries audited Complaints addressed Suspect patterns detected
25 / 67
26 / 67
1
2
3
4
5
6
27 / 67
The web keeps growing. But growth is no longer exponential?
28 / 67
They may switch to the search engine that has the best coverage of the web. Users (sometimes) care about recall. If we underestimate the size of the web, search engine results may have low recall.
29 / 67
30 / 67
31 / 67
Any sum of two numbers is its own dynamic page on Google. (Example: “2+4”) Many other dynamic sites generating infinite number of pages
Example: Your laptop Is it part of the web?
32 / 67
There used to be (and still are?) billions of pages that are only indexed by anchor text.
33 / 67
34 / 67
35 / 67
max url depth, max count/host, anti-spam rules, priority rules etc.
anchor text, frames, meta-keywords, size of prefix etc.
36 / 67
1
2
3
4
5
6
38 / 67
180 1000 180 1000 + 20 1000
39 / 67
1
intro vector space classification
2
very simple vector space classification: Rocchio
3
kNN
40 / 67
41 / 67
42 / 67
x x x x
43 / 67
1
2
3
4
5
6
44 / 67
45 / 67
46 / 67
µNR
47 / 67
48 / 67
The training set is given as part of the input in text classification. It is interactively created in relevance feedback.
49 / 67
The centroid is the average of all documents in the class.
50 / 67
51 / 67
1 |Dj|
52 / 67
x x x x
53 / 67
We can interpret the centroid as the prototype of the class.
54 / 67
55 / 67
56 / 67
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b b b b b b b b b b b b b b b b b b b
57 / 67
1
2
3
4
5
6
58 / 67
59 / 67
We expect a test document d to have the same label as the training documents located in the local region surrounding d.
60 / 67
61 / 67
x x x x x x x x x x x
62 / 67
63 / 67
64 / 67
65 / 67
66 / 67
But linear preprocessing of documents is as expensive as training Naive Bayes. You will always preprocess the training set, so in reality training time of kNN is linear.
67 / 67