crawling html
play

Crawling HTML Query processing Content Analysis Indexing Crawling - PDF document

10/12/2010 Class Overview Other Cool Stuff Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network Layer Today Crawlers Server Architecture Graphic by Stephen Combs (HowStuffWorks.com) &


  1. 10/12/2010 Class Overview Other Cool Stuff Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network Layer Today • Crawlers • Server Architecture Graphic by Stephen Combs (HowStuffWorks.com) & Kari Meoller(Turner Broadcasting) Standard Web Search Engine Architecture store documents, check for duplicates, extract links crawl the web DocIds create an user user inverted index query CRAWLERS… Search inverted show results engine To user index servers Slide adapted from Marti Hearst / UC Berkeley] 1

  2. 10/12/2010 Danger Will Robinson!! Open-Source Crawlers • GNU Wget • Consequences of a bug – Utility for downloading files from the Web. – Fine if you just need to fetch files from 2-3 sites. • Heritix – Open-source, extensible, Web-scale crawler – Easy to get running Easy to get running. – Web-based UI • Nutch – Featureful, industrial strength, Web search package. – Includes Lucene information retrieval part • TF/IDF and other document ranking Max 6 hits/server/minute • Optimized, inverted-index data store – You get complete control thru easy programming. plus…. http://www.cs.washington.edu/lab/policies/crawlers.html Thinking about Efficiency Search Engine Architecture • Crawler (Spider) • Clock cycle: 2 GHz – Typically completes 2 instructions / cycle – Searches the web to find pages. Follows hyperlinks. Never stops • ~10 cycles / instruction, but pipelining & parallel execution – Thus: 4 billion instructions / sec • Indexer • Disk access: 1-10ms – Produces data structures for fast searching of all – Depends on seek distance, published average is 5ms p p g words in the pages – Thus perform 200 seeks / sec • Retriever – (And we are ignoring rotation and transfer times) – Query interface – Database lookup to find hits • Disk is 20 Million times slower !!! • 300 million documents • 300 GB RAM, terabytes of disk • Store index in Oracle database? – Ranking, summaries • Store index using files and unix filesystem? • Front End 10/12/2010 6:07 PM 10 Spiders = Crawlers Spiders (Crawlers, Bots) • Queue := initial page URL 0 • 1000s of spiders • Do forever – Dequeue URL • Various purposes: – Fetch P – Search engines – Parse P for more URLs; add them to queue – Digital rights management – Pass P to (specialized?) indexing program – Advertising – Advertising • Issues… – Spam – Which page to look at next? – Link checking – site validation • keywords, recency, focus, ??? – Avoid overloading a site – How deep within a site to go? – How frequently to visit pages? – Traps! 2

  3. 10/12/2010 Crawling Issues Robot Exclusion • Storage efficiency • Search strategy • Person may not want certain pages indexed. – Where to start – Link ordering • Crawlers should obey Robot Exclusion Protocol. – Circularities – But some don’t – Duplicates – Checking for changes • Look for file robots.txt at highest directory level • Politeness • Politeness – If domain is www.ecom.cmu.edu, robots.txt goes in – Forbidden zones: robots.txt www.ecom.cmu.edu/robots.txt – CGI & scripts – Load on remote servers • Specific document can be shielded from a crawler – Bandwidth (download what need) by adding the line: • Parsing pages for links • Scalability <META NAME="ROBOTS” CONTENT="NOINDEX"> • Malicious servers: SEOs Danger, Danger Robots Exclusion Protocol • Ensure that your crawler obeys robots.txt • Format of robots.txt • Don’t make any of these typical mistakes: – Two fields. User-agent to specify a robot – Provide contact info in user-agent field . – Disallow to tell the agent what to ignore • To exclude all robots from a server: – Monitor the email address User-agent: * – Notify the CS Lab Staff Notify the CS Lab Staff Disallow: / Disallow: / – Honor all Do Not Scan requests • To exclude one robot from two directories: – Post any "stop-scanning" requests User-agent: WebCrawler Disallow: /news/ – “The scanee is always right." Disallow: /tmp/ • View the robots.txt specification at http://info.webcrawler.com/mak/projects/robots/norobots.html – Max 6 hits/server/minute Outgoing Links? Which tags / attributes hold URLs? • Parse HTML… Anchor tag: <a href=“URL” … > … </a> • Looking for…what? Option tag: <option value=“URL”…> … </option> Map: <area href=“URL” …> Frame: <frame src=“URL” …> F f “URL” anns html foos ? Bar baz hhh www A href = www.cs Frame font zzz Link to an image: <img src=“URL” …> ,li> bar bbb anns html foos Bar baz hhh www A href = ffff zcfg www.cs bbbbb z Frame font zzz ,li> bar bbb Relative path vs. absolute path: <base href= …> Bonus problem: Javascript In our favor: Search Engine Optimization 3

  4. 10/12/2010 Structure of Mercator Spider Web Crawling Strategy • Starting location(s) • Traversal order Document fingerprints – Depth first (LIFO) – Breadth first (FIFO) – Or ??? • Politeness • Cycles? • Coverage? 1. Remove URL from queue 5. Extract links 2. Simulate network protocols & REP 6. Download new URL? 3. Read w/ RewindInputStream (RIS) 7. Has URL been seen before? 4. Has document been seen before? 8. Add URL to frontier (checksums and fingerprints) Fetching Pages URL Frontier (priority queue) • Need to support http, ftp, gopher, .... • Most crawlers do breadth-first search from seeds. – Extensible! • Politeness constraint: don’t hammer servers! • Need to fetch multiple pages at once. – Obvious implementation: “live host table” • Need to cache as much as possible – Will it fit in memory? – DNS – robots.txt – Is this efficient? – Documents themselves (for later processing) Documents themselves (for later processing) • Mercator’s politeness: • Need to be defensive! – One FIFO subqueue per thread. – Need to time out http connections. – Choose subqueue by hashing host’s name. – Watch for “crawler traps” (e.g., infinite URL names.) – Dequeue first URL whose host has NO outstanding requests. – See section 5 of Mercator paper. – Use URL filter module – Checkpointing! Duplicate Detection Nutch: A simple architecture • Seed set • URL-seen test: has URL been seen before? • Crawl – To save space, store a hash • Remove duplicates • Content-seen test: different URL, same doc. • Extract URLs (minus those we’ve been to) – Supress link extraction from mirrored pages. p p g – new frontier • What to save for each doc? • Crawl again – 64 bit “document fingerprint” • Can do this with Map/Reduce architecture – Minimize number of disk reads upon retrieval. 4

  5. 10/12/2010 Mercator Statistics Advanced Crawling Issues • Limited resources – Fetch most important pages first • Topic specific search engines – Only care about pages which are relevant to topic Exponentially increasing size Exponentially increasing size PAGE TYPE PERCENT PAGE TYPE PERCENT “Focused crawling” text/html 69.2% image/gif 17.9% image/jpeg 8.1% • Minimize stale pages text/plain 1.5 – Efficient re-fetch to keep index timely pdf 0.9% – How track the rate of change for pages? audio 0.4% zip 0.4% postscript 0.3% other 1.4% Outline Focused Crawling • Priority queue instead of FIFO. • Search Engine Overview • • How to determine priority? • HTTP – Similarity of page to driving query • Use traditional IR measures • Crawlers • Exploration / exploitation problem – Backlink • Server Architecture Server Architecture • How many links point to this page? – PageRank (Google) • Some links to this page count more than others – Forward link of a page – Location Heuristics • E.g., Is site in .edu? • E.g., Does URL contain ‘home’ in it? – Linear combination of above Connecting on the WWW Server Architecture I t Internet t Web Browser Web Server Client OS Server OS 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend