info 4300 cs4300 information retrieval slides adapted
play

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 19/25: Web Search Basics and Classification Paul Ginsparg Cornell University, Ithaca, NY 9 Nov 2010 1 / 67


  1. INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 19/25: Web Search Basics and Classification Paul Ginsparg Cornell University, Ithaca, NY 9 Nov 2010 1 / 67

  2. Discussion 5, Tue 16 Nov For this class, read and be prepared to discuss the following: Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI ’04, 2004. http://www.usenix.org/events/osdi04/tech/full papers/dean/dean.pdf 2 / 67

  3. Overview Recap 1 Spam 2 Size of the web 3 Intro vector space classification 4 Rocchio 5 kNN 6 3 / 67

  4. Outline Recap 1 Spam 2 Size of the web 3 Intro vector space classification 4 Rocchio 5 kNN 6 4 / 67

  5. Duplicate detection The web is full of duplicated content. More so than many other collections Exact duplicates Easy to eliminate E.g., use hash/fingerprint Near-duplicates Abundant on the web Difficult to eliminate For the user, it’s annoying to get a search result with near-identical documents. Recall marginal relevance We need to eliminate near-duplicates. 5 / 67

  6. Shingling: Summary Input: N documents Choose n-gram size for shingling, e.g., n = 5 Pick 200 random permutations, represented as hash functions Compute N sketches: 200 × N matrix shown on previous slide, one row per permutation, one column per document Compute N · ( N − 1) pairwise similarities 2 Transitive closure of documents with similarity > θ Index only one document from each equivalence class 6 / 67

  7. Web IR: Differences from traditional IR Links: The web is a hyperlinked document collection. Queries: Web queries are different, more varied and there are a lot of them. How many? ≈ 10 9 Users: Users are different, more varied and there are a lot of them. How many? ≈ 10 9 Documents: Documents are different, more varied and there are a lot of them. How many? ≈ 10 11 Context: Context is more important on the web than in many other IR applications. Ads and spam 7 / 67

  8. Types of queries / user needs in web search Informational user needs: I need information on something. “low hemoglobin” We called this “information need” earlier in the class. On the web, information needs proper are only a subclass of user needs. Other user needs: Navigational and transactional Navigational user needs: I want to go to this web site. “hotmail”, “myspace”, “United Airlines” Transactional user needs: I want to make a transaction. Buy something: “MacBook Air” Download something: “Acrobat Reader” Chat with someone: “live soccer chat” Difficult problem: How can the search engine tell what the user need or intent for a particular query is? 8 / 67

  9. Bowtie structure of the web A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,S. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 33:309–320, 2000. Strongly connected component (SCC) in the center Lots of pages that get linked to, but don’t link (OUT) Lots of pages that link to other pages, but don’t get linked to (IN) Tendrils, tubes, islands # of in-links (in-degree) averages 8–15, not randomly distributed (Poissonian), instead a power law: # pages with in-degree i is ∝ 1 / i α , α ≈ 2 . 1 9 / 67

  10. Poisson Distribution Bernoulli process with N trials, each probability p of success: � N � p m (1 − p ) N − m . p ( m ) = m Probability p ( m ) of m successes, in limit N very large and p small, parametrized by just µ = Np ( µ = mean number of successes). N ! ( N − m )! = N ( N − 1) · · · ( N − m + 1) ≈ N m , For N ≫ m , we have m !( N − m )! ≈ N m � N N ! � so ≡ m ! , and m m ! N m � µ ≈ µ m = e − µ µ m p ( m ) ≈ 1 � m � 1 − µ � N − m 1 − µ � N � m ! lim N N N m ! N →∞ (ignore (1 − µ/ N ) − m since by assumption N ≫ µ m ). N dependence drops out for N → ∞ , with average µ fixed ( p → 0). The form p ( m ) = e − µ µ m m ! is known as a Poisson distribution µ m m ! = e − µ · e µ = 1). (properly normalized: � ∞ m =0 p ( m ) = e − µ � ∞ m =0 10 / 67

  11. Poisson Distribution for µ = 10 p ( m ) = e − 10 10 m m ! 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 30 Compare to power law p ( m ) ∝ 1 / m 2 . 1 11 / 67

  12. Power Law p ( m ) ∝ 1 / m 2 . 1 and Poisson p ( m ) = e − 10 10 m m ! 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 10 20 30 40 50 60 70 80 90 100 12 / 67

  13. Power Law p ( m ) ∝ 1 / m 2 . 1 and Poisson p ( m ) = e − 10 10 m m ! 1 0.1 (log–log scale) 0.01 0.001 0.0001 1e-05 1e-06 1e-07 1e-08 1e-09 1 10 100 1000 10000 13 / 67

  14. The spatial context: Geo-search Three relevant locations Server (nytimes.com → New York) Web page (nytimes.com article about Albania) User (located in Palo Alto) Locating the user IP address Information provided by user (e.g., in user profile) Mobile phone Geo-tagging: Parse text and identify the coordinates of the geographic entities Example: East Palo Alto CA → Latitude: 37.47 N, Longitude: 122.14 W Important NLP problem 14 / 67

  15. Outline Recap 1 Spam 2 Size of the web 3 Intro vector space classification 4 Rocchio 5 kNN 6 15 / 67

  16. The goal of spamming on the web You have a page that will generate lots of revenue for you if people visit it. Therefore, you would like to direct visitors to this page. One way of doing this: get your page ranked highly in search results. How can I get my page ranked highly? 16 / 67

  17. Spam technique: Keyword stuffing / Hidden text Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks etc. Used to be very effective, most search engines now catch these 17 / 67

  18. Keyword stuffing 18 / 67

  19. Spam technique: Doorway and lander pages Doorway page: optimized for a single keyword, redirects to the real target page Lander page: optimized for a single keyword or a misspelled domain name, designed to attract surfers who will then click on ads 19 / 67

  20. Lander page Number one hit on Google for the search “composita” The only purpose of this page: get people to click on the ads and make money for the page owner 20 / 67

  21. Spam technique: Duplication Get good content from somewhere (steal it or produce it yourself) Publish a large number of slight variations of it For example, publish the answer to a tax question with the spelling variations of “tax deferred” on the previous slide 21 / 67

  22. Spam technique: Cloaking Serve fake content to search engine spider So do we just penalize this always? No: legitimate uses (e.g., different content to US vs. European users) 22 / 67

  23. Spam technique: Link spam Create lots of links pointing to the page you want to promote Put these links on pages with high (or at least non-zero) PageRank Newly registered domains (domain flooding) A set of pages that all point to each other to boost each other’s PageRank (mutual admiration society) Pay somebody to put your link on their highly ranked page (“schuetze horoskop” example) Leave comments that include the link on blogs 23 / 67

  24. SEO: Search engine optimization Promoting a page in the search rankings is not necessarily spam. It can also be a legitimate business – which is called SEO. You can hire an SEO firm to get your page highly ranked. There are many legitimate reasons for doing this. For example, Google bombs like Who is a failure? And there are many legitimate ways of achieving this: Restructure your content in a way that makes it easy to index Talk with influential bloggers and have them link to your site Add more interesting and original content 24 / 67

  25. The war against spam Quality indicators Links, statistically analyzed (PageRank etc) Usage (users visiting a page) No adult content (e.g., no pictures with flesh-tone) Distribution and structure of text (e.g., no keyword stuffing) Combine all of these indicators and use machine learning Editorial intervention Blacklists Top queries audited Complaints addressed Suspect patterns detected 25 / 67

  26. Webmaster guidelines Major search engines have guidelines for webmasters. These guidelines tell you what is legitimate SEO and what is spamming. Ignore these guidelines at your own risk Once a search engine identifies you as a spammer, all pages on your site may get low ranks (or disappear from the index entirely). There is often a fine line between spam and legitimate SEO. Scientific study of fighting spam on the web: adversarial information retrieval 26 / 67

  27. Outline Recap 1 Spam 2 Size of the web 3 Intro vector space classification 4 Rocchio 5 kNN 6 27 / 67

  28. Growth of the web The web keeps growing. But growth is no longer exponential? 28 / 67

  29. Size of the web: Who cares? Media Users They may switch to the search engine that has the best coverage of the web. Users (sometimes) care about recall. If we underestimate the size of the web, search engine results may have low recall. Search engine designers (how many pages do I need to be able to handle?) Crawler designers (which policy will crawl close to N pages?) 29 / 67

  30. What is the size of the web? Any guesses? 30 / 67

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend