HTTP Crawling Crawling, session 2 CS6200: Information Retrieval - PowerPoint PPT Presentation

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton

A Basic Crawler A crawler maintains a frontier – a collection of pages to be crawled – and iteratively selects and crawls pages from it. • The frontier is initialized with a list of seed pages . • The next page is selected carefully, for politeness and performance reasons. • New URLs are processed and filtered before being added to the frontier. We will cover these details in subsequent sessions.

HTTP Fetching Requesting and downloading a URL Crawler DNS Server Web Server involves several steps. Request IP for 1. A DNS server is asked to translate www.wikipedia.org the domain into an IP address. IP is: 208.80.154.224 2. (optional) An HTTP HEAD request is Connect to: made at the IP address to 208.80.154.224:80 determine the page type, and HTTP HEAD / whether the page contents have changed since the last crawl. <HTTP headers for /> HTTP GET / 3. An HTTP GET request is made to <HTML content for /> retrieve the new page contents. Request process for http://www.wikipedia.org/

HTTP Requests The HTTP request and response take GET / HTTP/1.1 A. the following form: HTTP/1.1 200 OK Server: Apache Last-Modified: Sun, 20 Jul 2014 01:37:07 GMT A. HTTP Request:   Content-Type: text/html Content-Length: 896 B. <method> <url> <HTTP version>   Accept-Ranges: bytes Date: Thu, 08 Jan 2015 00:36:25 GMT [<optional headers>] Age: 12215 Connection: keep-alive B. Response Status and Headers:   <!DOCTYPE html> <html lang=en> <HTTP version> <code> <status>   <meta charset="utf-8"> <title>Unconfigured domain</title> [<headers>] C. <link rel="shortcut icon" href="// wikimediafoundation.org/favicon.ico"> ... C. Response Body HTTP Request and Response

URL Extraction Downloaded files must be parsed according to their content type (usually available in the Content-Type header), and URLs extracted for adding to the frontier. HTML documents in the wild often have formatting errors which the parser must address. Other document formats have their own issues. URLs may be embedded in PDFs, Word documents, etc. Many URLs are missed, especially due to dynamic URL schemes and web pages generated by JavaScript and AJAX calls. This is part of the so-called “dark web.”

URL Canonicalization Many possible URLs can refer to the same resource. It’s important for the crawler (and index!) to use a canonical, or normalized, http://example.com/some/../folder?id=1#anchor version of the URLs to avoid repeated requests. Many rules have been used; some are http://example.com/some/../folder guaranteed to only rewrite URLs to refer to the same resource, and others can make mistakes. http://www.example.com/folder It can also be worthwhile to create specific normalization rules for important web Conversion to Canonical URL domains, e.g. by encoding which URL parameters result in different web content.

Rules for Canonicalization Here are a few possible URL canonicalization rules. Rule Safe? Example http://example.com:80 → http://example.com Remove default port Always Decoding octets for unreserved http://example.com/%7Ehome → http://example.com/~home Always characters http://example.com/a/./b/../c → http://example.com/a/c Remove . and .. Usually Force trailing slash for http://example.com/a/b → http://example.com/a/b/ Usually directories http://example.com/index.html → http://example.com Remove default index pages Sometimes http://example.com/a#b/c → http://example.com/a Removing the fragment Sometimes

Wrapping Up Web crawling requires attending to many details. DNS responses should be cached, HTTP HEAD requests should generally be sent before GET requests, and so on. Extracting and normalizing URLs is important, because it dramatically affects your coverage and the time wasted on crawling, indexing, and ultimately retrieving duplicate content. Next, we’ll see how to detect duplicate pages hosted from different URLs.

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval - PowerPoint PPT Presentation

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A Basic Crawler A crawler maintains a frontier a collection of pages to be crawled and iteratively selects and crawls pages from it. The

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

Crawling Module Introduction CS6200: Information Retrieval Slides by: Jesse Anderton Motivating

CSE 333 Section 9 HW4 & Review Using Telnet 1. Launch the server ./http333d <port>

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

SlideSet #16: HTTP and HTTPS Chapter 21 4 th edition or Chapter 17 5 th edition

CS5412 / LECTURE 20 Ken Birman & Kishore APACHE ARCHITECTURE Pusukuri, Spring 2019

HTTP - Request/ Response HTTP - Documentation HTTP/1.1 is defined by RFC2616 of the IETF

Certificates CS 142 Lecture Notes: Network Security Slide 1 SSL/TLS Overview Browser Server

Accessing Web Files in Python Learning Objectives Understand simple web-based model of data

I TEM R ESPONSE T HEORY Saima Ghazal Wei Zhang I NTRODUCTION OF I TEM R ESPONSE T HEORY IRT

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval - PowerPoint PPT Presentation

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A Basic Crawler A crawler maintains a frontier a collection of pages to be crawled and iteratively selects and crawls pages from it. The

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

Crawling Module Introduction CS6200: Information Retrieval Slides by: Jesse Anderton Motivating

CSE 333 Section 9 HW4 &amp; Review Using Telnet 1. Launch the server ./http333d &lt;port&gt;

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

SlideSet #16: HTTP and HTTPS Chapter 21 4 th edition or Chapter 17 5 th edition

CS5412 / LECTURE 20 Ken Birman &amp; Kishore APACHE ARCHITECTURE Pusukuri, Spring 2019

HTTP - Request/ Response HTTP - Documentation HTTP/1.1 is defined by RFC2616 of the IETF

Certificates CS 142 Lecture Notes: Network Security Slide 1 SSL/TLS Overview Browser Server

Accessing Web Files in Python Learning Objectives Understand simple web-based model of data

I TEM R ESPONSE T HEORY Saima Ghazal Wei Zhang I NTRODUCTION OF I TEM R ESPONSE T HEORY IRT

CSE 333 Section 9 HW4 & Review Using Telnet 1. Launch the server ./http333d <port>

CS5412 / LECTURE 20 Ken Birman & Kishore APACHE ARCHITECTURE Pusukuri, Spring 2019