Pitfalls of Crawling Crawling, session 7 CS6200: Information - PowerPoint PPT Presentation

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Crawling at Scale A commercial crawler should support thousands of HTTP requests per second. If the crawler is distributed, that applies for each node. Achieving this requires careful engineering of each component. • DNS resolution can quickly become a bottleneck, particularly because sites often have URLs with many subdomains at a single IP address. • The frontier can grow extremely rapidly – hundreds of thousands of URLs per second are not uncommon. Managing the filtering and prioritization of URLs is a challenge. • Spam and malicious web sites must be addressed, lest they overwhelm the frontier and waste your crawling resources. For instance, some sites respond to crawlers by intentionally adding seconds of latency to each HTTP response. Other sites respond with data crafted to confuse, crash, or mislead a crawler.

Duplicate URL Detection at Scale Lee et al’s DRUM algorithm gives a sense of the requirements of large scale de-duplication. It manages a collection of tuples of keys (hashed URLs), values (arbitrary data, such as quality scores), and aux data (URLs). It supports the following operations: • check – Does a key exist? If so, fetch its value. • update – Merge new tuples into the Data flow for DRUM: A tiered system of repository. buffers in RAM and on disk is used to support large-scale operations. • check+update – Check and update in a single pass.

IRLBot Operation DRUM is used a storage for the IRLBot crawler. A new URL passes through the IRLBot Architecture following steps. 1. Uniqueness check 1. The URLSeen DRUM checks whether the 2. Spam check URL has already been fetched. 2. If not, two budget checks filter out spam links (discussed next). 3. Next, we check whether the URL passes its robots.txt. If necessary, we fetch robots.txt from the server. 3. robots.txt check 4. Finally, the URL is passed to the queue to 4. Sent to crawlers be crawled by the next available thread.

Link Spam The web is full of link farms and other forms of link spam, generally posted by people trying to manipulate page quality measures such as PageRank. These links waste a crawler’s resources, and detecting and avoiding them is important for correct page quality calculations. One way to mitigate this, implemented in IRLBot, is based on the observation that spam servers tend to have very large numbers of pages linking to each other. They assign a budget to each domain based on the number of in-links from other domains. The crawler de-prioritizes links from domains which have exceeded their budget, so link-filled spam domains are largely ignored.

Spider Traps A spider trap is a collection of web pages which, intentionally or not, provide an infinite space of URLs to crawl. Some site administrators place spider traps on their sites in order to trap or crash spambots, or defend against malicious bandwidth-consuming scripts. A common example of a benign spider A benign spider trap on trap is a calendar which links http://www.timeanddate.com continually to the next year.

Avoiding Spider Traps [...] User-agent: * The first defense against spider traps Disallow: /createshort.html Disallow: /scripts/savecustom.php is to have a good politeness policy, Disallow: /scripts/wquery.php Disallow: /scripts/tzq.php and always follow it. Disallow: /scripts/savepersonal.php Disallow: /information/mk/ Disallow: /information/feedback-save.php • By avoiding frequent requests to the Disallow: /information/feedback.html? Disallow: /gfx/stock/ same domain, you reduce the Disallow: /bm/ possible damage a trap can do. Disallow: /eclipse/in/*?iso Disallow: /custom/save.php Disallow: /calendar//index.html • Most sites with spider traps provide Disallow: /calendar//monthly.html Disallow: /calendar//custom.html instructions for avoiding them in Disallow: /counters//newyeara.html Disallow: /counters//worldfirst.html robots.txt. [...] From http://www.timeanddate.com/robots.txt

Wrapping Up A breadth-first search implementation of crawling is not sufficient for coverage, freshness, spam avoidance, or other needs of a real crawler. Scaling the crawler up takes careful engineering, and often detailed systems knowledge of the hardware architecture you’re developing for. Next, we’ll look at how to efficiently store the content we’ve crawled.

Pitfalls of Crawling Crawling, session 7 CS6200: Information - PowerPoint PPT Presentation

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton Crawling at Scale A commercial crawler should support thousands of HTTP requests per second. If the crawler is distributed, that applies for each

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

Crawling Module Introduction CS6200: Information Retrieval Slides by: Jesse Anderton Motivating

Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health,

CIS4930/5930: Machine Learning Introduction to ML Alan Kuhnle Florida State University Slides

Exploring Python Bytecode @AnjanaVakil EuroPython 2016 Hi! Im Anjana, and Im a Pythoholic

Email Spam and the Ethics of An3spam measures Behrooz

Vsevolod Stakhov https://rspamd.com Why rspamd? A real example Rspamd in nutshell Uses

Email Administra5on Don Porter CSE/ISE 311: Systems Administra5on

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel Berkhin, Hector

Sambuz

Useful Links

Newsletter

Mail Us