CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender - PowerPoint PPT Presentation

Ashish Kumar Sinha CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph

Web-Crawling Apache Nutch Crawling Algorithm in Apache Nutch CONTENTS RestAPI Demo Conclusion

Web-Crawling

• Web-Crawling is a process by which search engines crawler/spiders/bots scan a website and collect details about each page: titles, images, keywords, What is Web- other linked pages, etc. Crawling? • It also discovers updated content on the web, such as new sites or pages, changes to existing sites, and dead links. According to Google “The crawling process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As our crawlers visit these websites, they use links on those sites to discover other pages . ” - http://www.digitalgenx.com/learn/crawling-in-seo.php •

• Crawlers are widely used by search engines like Google, Yahoo or Bing to retrieve contents for a URL, examine that page for other links, retrieve the URLs for those links, and so on. • Google is the first company that published its web- Web-Crawling crawler which has two programs, spider and mite. • Spider maintains the seeds and mite is responsible for downloading webpages. • Googlebot and Bingbot are the most popular spiders owned by Google and Bing respectively. https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence=1&isAllowed=y

• Web-scraping is closely related to web-crawling but it is a different technique • The main purpose of web-scraper is to convert unstructured data found on the internet to Web-Crawling vs structured format for analyzing or for later reference Web-Scraping • Web-scraping ( like web-crawling ) often has the ability to browse different pages and follow links. • But ( unlike web-crawling ) its primary purpose is extracting the data on those pages and not indexing the web https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence=1&isAllowed=y

Process Flo low of f Se Sequentia ial l Web- Cr Crawle ler Popular open source web-crawlers: • Scrapy • A python-based web-crawling framework • Heritrix • A Java based web-crawler designed for web-archiving. • Written by the Internet Archive • HTTrack • A ‘C’ based web -crawler • Developed by Xavier Roche • Apache Nutch https://homepage.divms.uiowa.edu/~psriniva/Papers/crawlingFinal.pdf

Frontier In Initialization • A crawl frontier is a data structure used for storage of URLs eligible for crawling and supporting such operations as adding URLs and selecting for crawl (can be seen as priority queue) • The initial list of URLs contained in the crawler frontier are known as seeds: • Crawling “seeds” are the pages at which a crawler commences • Seeds should be selected carefully, and multiple seeds may be necessary to ensure good coverage https://homepage.divms.uiowa.edu/~psriniva/Papers/crawlingFinal.pdf

More on Frontier More on Frontier • The web-crawler will constantly ask the frontier what pages to visit. • As the crawler visits each of those pages, it will inform the frontier with the response of each page. • The crawler will also update the crawler frontier with any new hyperlinks contained in those pages it has visited. • These hyperlinks are added to the frontier and will visit those new webpages based on the policies of the crawler frontier. Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005

Fetching Fetching • The fetcher is a multi-threaded application (capable of processing more than one tasks in parallel) that employs protocol plugins to retrieve the content of a set of URLs. • The Protocols plugin collects information about the network protocols supported by the system • Plug-in is a software component that adds a specific feature to an existing computer program. • Network protocols are formal standards and policies made up of rules, procedures and formats that defines communication between two or more devices over a network . Network protocols conducts the action, policies, and affairs of the end-to-end process of timely, secured and managed data or network communication. • It is analogous to downloading of a page (similar to what a browser does when you view the page). Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005

Parsing Involves one or more of the following: • Simple hyperlink/URL extraction • Tidying up the HTML content in order to analyze the HTML tag tree • HTML Tag Tree is the hierarchical representation of the HTML page in the form of tree structure • Convert the extracted URL to a canonical form, remove stopwords from the page’s content and stem the remaining words. • Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. Eg: The words consulting, consultant and consultative is stemmed to consult Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005

The behaviour of a web-crawler depends on the outcome of a combination of policies: • Selection Policy: • It is used to download the appropriate pages, that is, it states which pages to download. • Re-visit Policy: Web-Crawling • It states when to check for changes to the pages. • Two simple (or naive) re-visiting policies: Policies • Uniform policy: All pages in the collection are re-visited with the same frequency. • Proportional policy: The pages that change more frequently are re- visited. There are quantitative methods to measure the visiting frequency • Politeness Policy states to avoid overloading websites. • Parallelization Policy: • To avoid repeated downloads of the same page when we run a parallel crawler (A crawler that runs multiple processes). Bamrah NHS, Satpute BS, Patil P., 2014. Web Forum Crawling Techniques. International Journal of Computer Applications. 85(36 – 41).

• A strategy for a crawler to choose URLs from a crawling queue. Crawl Ordering • It is related to one of the following two main tasks: Policy (o (or Crawl • Downloading newly discovered webpages not Strategy) represented in the index • Refreshing copies of pages likely to have important updates Ostroumova L. et al., 2014. Crawling Policies Based on Web Page Popularity Prediction. European Conference on Information Retrieval 2014: Advances in Information Retrieval. 100-111

• A strategy for a crawler to choose URLs from a crawling queue. • It is related to one of the following two main tasks: • Downloading newly discovered webpages not represented in the index • Refreshing copies of pages likely to have important updates • Breadth-first search: Crawl Ordering • It is a technique where all the links in a page are followed in sequential order before the crawler Policy follows the child links. • Child links can only be generated from the parent links, crawlers need to save all the parent links in a page in order to follow the child links. Hence, consumes lot of memory https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence =1&isAllowed=y

• A strategy for a crawler to choose URLs from a crawling queue. • It is related to one of the following two main tasks: • Downloading newly discovered webpages not represented in the index • Refreshing copies of pages likely to have important updates • Breadth-first search: • It is a technique where all the links in a page are followed in sequential order before the crawler follows the child links. • Crawl Ordering Child links can only be generated from the parent links, crawlers need to save all the parent links in a page in order to follow the child links. Hence, consumes lot of memory • Depth-first search: Policy • It is an algorithm where the crawler starts with the parent link and crawls the child link until it reaches the end and then continues with another parent link • It does not have to save all the parents links in a page, it consumes relatively less memory than BFS https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence =1&isAllowed=y

• Prioritize by indegree • The page with the highest number of incoming hyperlinks from previously downloaded pages, is downloaded next. • Incoming links are those links which are coming to our website from another website Crawl Ordering • Outgoing links are those types of links which are going to another site from our site Policy • Prioritize by PageRank • Pages are downloaded in descending order of PageRank , as estimated based on the pages and links acquired so far by the crawler. • PageRank is an algorithm used by Google Search to rank webpages in their search engine results Ostroumova L. et al., 2014. Crawling Policies Based on Web Page Popularity Prediction. European Conference on Information Retrieval 2014: Advances in Information Retrieval. 100-111

Apache Nutch

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender - PowerPoint PPT Presentation

Ashish Kumar Sinha CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache Nutch Crawling Algorithm in Apache Nutch CONTENTS RestAPI Demo Conclusion Web-Crawling Web-Crawling is a process by

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

SIG IG1510: Power Your Material Editing wit ith Substance Designer, MDL and Ir Iray Sebastien

Sahar hara a Be Beach ch Sahara ara Beach ch Perfec fect place e to connec ect wit ith

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Inference in OSNs via Lightweight Partial Crawls Jithin K. Sreedharan Inria, France Konstantin

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

* A new open source language * A concurrent garbage collected language * Builds large programs

Machine Learning: A Promising Direction for Web Tracking Countermeasures Jason Bau, Jonathan

Twi$erEcho : a Distributed Focused Crawler to Support Open

AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender - PowerPoint PPT Presentation

Ashish Kumar Sinha CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache Nutch Crawling Algorithm in Apache Nutch CONTENTS RestAPI Demo Conclusion Web-Crawling Web-Crawling is a process by

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

SIG IG1510: Power Your Material Editing wit ith Substance Designer, MDL and Ir Iray Sebastien

Sahar hara a Be Beach ch Sahara ara Beach ch Perfec fect place e to connec ect wit ith

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Inference in OSNs via Lightweight Partial Crawls Jithin K. Sreedharan Inria, France Konstantin

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &amp;

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

* A new open source language * A concurrent garbage collected language * Builds large programs

Machine Learning: A Promising Direction for Web Tracking Countermeasures Jason Bau, Jonathan

Twi$erEcho : a Distributed Focused Crawler to Support Open

AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &