1 A Crawler Architecture Web Crawler Starts with a set of seeds - PDF document

Table of Content • Basic crawling architecture and flow  Distributed crawling • Scheduling: Where to crawl Crawling  Crawling control with robots.txt  Freshness  Focused crawling • URL discovery T. Yang, UCSB 290N • Deep web, Sitemaps, & Data feeds Some of slides from Crofter/Metzler/Strohman’s • Data representation and store textbook Web Crawler Downloading Web Pages • Finds and downloads web pages automatically for • Every page has a unique uniform resource locator search and web mining (URL) • Web is huge and constantly growing • Web pages are stored on web servers that use HTTP to exchange information with client software  HTTP /1.1 HTTP Downloading Web Pages • Need a scalable domain name system (DNS) server (hostname to IP address translation) • Crawler attempts to connect to server host using specific port • After connection, crawler sends an HTTP request to the web server to request a page  usually a GET request 1

A Crawler Architecture Web Crawler • Starts with a set of seeds  Seeds are added to a URL request queue • Crawler starts fetching pages from the request queue • Downloaded pages are parsed to find link tags that might contain other useful URLs to fetch • New URLs added to the crawler’s request queue, or frontier • Scheduler prioritizes to discover new or refresh the existing URLs • Repeat the above process Distributed Crawling: Parallel Execution Variations of Distributed Crawlers • Crawlers may be running in diverse geographies – • Crawlers are independent USA, Europe, Asia, etc.  Fetch pages oblivious to each other.  Periodically update a master index • Static assignment  Incremental update so this is “cheap”  Distributed crawler uses a hash function to assign • Three reasons to use multiple computers URLs to crawling computers  Helps to put the crawler closer to the sites it crawls  hash function can be computed on the host part of  Reduces the number of sites the crawler has to each URL remember • Dynamic assignment  More computing resources  Master-slaves  Central coordinator splits URLs among crawlers A Distributed Crawler Architecture Options of URL outgoing link assignment • Firewall mode: each crawler only fetches URL within its partition – typically a domain  inter-partition links not followed • Crossover mode: Each crawler may following inter- partition links into another partition  possibility of duplicate fetching • Exchange mode: Each crawler periodically exchange URLs they discover in another partition 2

Multithreaded page downloader Table of Content • Crawling architecture and flow • Web crawlers spend a lot of time waiting for responses to requests • Schedule: Where to crawl  Multi-threaded for concurrency  Crawling control with robots.txt  Tolerate slowness  Freshness of some sites  Focused crawling • Few hundreds • URL discovery: of threads/machine • Deep web, Sitemaps, & Data feeds • Data representation and store How fast can spam URLs contaminate a queue? Where do we spider next? Start Start Page Page URLs crawled and parsed BFS depth = 2 BFS depth = 3 2000 URLs on the queue Normal avg outdegree = 10 50% belong to the spammer 100 URLs on the queue URLs in queue including a spam page. Web BFS depth = 4 Assume the spammer is able to 1.01 million URLs on the queue generate dynamic pages with 99% belong to the spammer 1000 outlinks Scheduling Issues: Where do we spider More URL Scheduling Issues next? • Conflicting goals • Keep all spiders busy (load balanced)  Avoid fetching duplicates repeatedly  Big sites are crawled completely; • Respect politeness and robots.txt  Discover and recrawl URLs frequently  Crawlers could potentially flood sites with requests – Important URLs need to have high priority for pages  What’s best? Quality, fresh, topic coverage  use politeness policies: e.g., delay between – Avoid/Minimize duplicate and spam requests to same web server  Revisiting for recently crawled URLs • Handle crawling abnormality:  Avoid getting stuck in traps should be excluded to avoid the endless  Tolerate faults with retry of revisiting of the same URLs. • Access properties of URLs to make a scheduling decision. 3

/robots.txt Robots.txt example • Protocol for giving spiders (“robots”) limited • No robot should visit any URL starting with access to a website, originally from 1994 "/yoursite/temp/", except the robot called “searchengine":  www.robotstxt.org/ • Website announces its request on what can(not) be User-agent: * crawled  For a URL, create a file robots.txt Disallow: /yoursite/temp/  This file specifies access restrictions  Place in the top directory of web server. User-agent: searchengine – E.g. www.cs.ucsb.edu/robots.txt Disallow: – www.ucsb.edu/robots.txt More Robots.txt example Freshness • Web pages are constantly being added, deleted, and modified • Web crawler must continually revisit pages it has already crawled to see if they have changed in order to maintain the freshness of the document collection  stale copies no longer reflect the real contents of the web pages Freshness Freshness • HTTP protocol has a special request type called • Not possible to constantly check all pages HEAD that makes it easy to check for page changes  Need to check important pages and pages that  returns information about page, not page itself change frequently  Information is not reliable. • Freshness is the proportion of pages that are fresh • Age as an approximation 4

Focused Crawling Table of Content • Basic crawling architecture and flow • Attempts to download only those pages that are • Schedule: Where to crawl about a particular topic  Crawling control with robots.txt  used by vertical search applications  Freshness • Rely on the fact that pages about a topic tend to  Focused crawling have links to other pages on the same topic  popular pages for a topic are typically used as seeds • Discover new URLs • Crawler uses text classifier to decide whether a page • Deep web, Sitemaps, & Data feeds is on topic • Data representation and store Discover new URLs & Deepweb Deep Web • Challenges to discover new URLs • Sites that are difficult for a crawler to find are collectively referred to as the deep (or hidden ) Web  Bandwidth/politeness prevent the crawler from  much larger than conventional Web covering large sites fully.  Deepweb • Three broad categories: • Strategies  private sites  Mining new topics/related URLs from news, blogs, – no incoming links, or may require log in with a valid account  form results facebook/twitters. – sites that can be reached only after entering some data into a  Idendify sites that tend to deliver more new URLs. form  Deepweb handling/sitemaps  scripted pages  RSS feeds – pages that use JavaScript, Flash, or another client-side language to generate links Sitemaps Sitemap Example • Placed at the root directory of an HTML server.  For example, http://example.com/sitemap.xml. • Sitemaps contain lists of URLs and data about those URLs, such as modification time and modification frequency • Generated by web server administrators • Tells crawler about pages it might not otherwise find • Gives crawler a hint about when to check a page for changes 5

Document Feeds Document Feeds • Two types: • Many documents are published  A push feed alerts the subscriber to new documents  created at a fixed time and rarely updated again  A pull feed requires the subscriber to check  e.g., news articles, blog posts, press releases, email periodically for new documents • Published documents from a single source can be • Most common format for pull feeds is called RSS ordered in a sequence called a document feed  Really Simple Syndication, RDF Site Summary,  new documents found by examining the end of the Rich Site Summary, or ... feed • Examples  CNN RSS newsfeed under different categories  Amazon RSS popular product feeds under different tags RSS Example RSS Example RSS Table of Content • Crawling architecture and flow • A number of channel elements:  Title • Scheduling: Where to crawl  Link  Crawling control with robots.txt  description  Freshness  ttl tag (time to live)  Focused crawling – amount of time (in minutes) contents should be cached • URL discovery • RSS feeds are accessed like web pages • Deep web, Sitemaps, & Data feeds  using HTTP GET requests to web servers that host • Data representation and store them • Easy for crawlers to parse • Easy to find new information 6

1 A Crawler Architecture Web Crawler Starts with a set of seeds - PDF document

Table of Content Basic crawling architecture and flow Distributed crawling Scheduling: Where to crawl Crawling Crawling control with robots.txt Freshness Focused crawling URL discovery T. Yang, UCSB 290N Deep web,

Crawling T. Yang, UCSB 293S Some of slides from Crofter/Metzler/Strohmans textbook Where are

Systematic Approach to Road Safety II usRAP Pilot Program Rural Road Safety Webinar Series

Detecting Pattern-Match Failures in Haskell Neil Mitchell and Colin Runciman York University

CS672: Approximation ALgorithms Spring 2017 Crash Course in Linear Programming Instructor:

Crawling the Web for Sebastian Nagel Apache Big Data Europe 2016 snagel@apache.org

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

Ontology-based approach for unsupervised and adaptive focused crawling Thomas HASSAN, Christophe

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

a framework for historical analysis and real-4me monitoring of BGP data Chiara Orsini, Alistair

WEBCOP: LOCATING NEIGHBORHOODS OF MALWARE ON THE WEB Reid Andersen Jay Stokes

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

A C r a w l i n g A p p l i c a t i o n w i t h R Wh a t a b o u t

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &

Inference in OSNs via Lightweight Partial Crawls Jithin K. Sreedharan Inria, France Konstantin

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

* A new open source language * A concurrent garbage collected language * Builds large programs

Machine Learning: A Promising Direction for Web Tracking Countermeasures Jason Bau, Jonathan

Twi$erEcho : a Distributed Focused Crawler to Support Open

Sambuz

Useful Links

Newsletter

Mail Us

1 A Crawler Architecture Web Crawler Starts with a set of seeds - PDF document

Table of Content Basic crawling architecture and flow Distributed crawling Scheduling: Where to crawl Crawling Crawling control with robots.txt Freshness Focused crawling URL discovery T. Yang, UCSB 290N Deep web,

Crawling T. Yang, UCSB 293S Some of slides from Crofter/Metzler/Strohmans textbook Where are

Systematic Approach to Road Safety II usRAP Pilot Program Rural Road Safety Webinar Series

Detecting Pattern-Match Failures in Haskell Neil Mitchell and Colin Runciman York University

CS672: Approximation ALgorithms Spring 2017 Crash Course in Linear Programming Instructor:

Crawling the Web for Sebastian Nagel Apache Big Data Europe 2016 snagel@apache.org

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

Ontology-based approach for unsupervised and adaptive focused crawling Thomas HASSAN, Christophe

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

a framework for historical analysis and real-4me monitoring of BGP data Chiara Orsini, Alistair

WEBCOP: LOCATING NEIGHBORHOODS OF MALWARE ON THE WEB Reid Andersen Jay Stokes

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web

Set11 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

A C r a w l i n g A p p l i c a t i o n w i t h R Wh a t a b o u t

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &amp;

Inference in OSNs via Lightweight Partial Crawls Jithin K. Sreedharan Inria, France Konstantin

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

* A new open source language * A concurrent garbage collected language * Builds large programs

Machine Learning: A Promising Direction for Web Tracking Countermeasures Jason Bau, Jonathan

Twi$erEcho : a Distributed Focused Crawler to Support Open

Sambuz

Useful Links

Newsletter

Mail Us

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &