CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender - - PowerPoint PPT Presentation

crawling wit ith
SMART_READER_LITE
LIVE PREVIEW

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender - - PowerPoint PPT Presentation

Ashish Kumar Sinha CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache Nutch Crawling Algorithm in Apache Nutch CONTENTS RestAPI Demo Conclusion Web-Crawling Web-Crawling is a process by


slide-1
SLIDE 1

CRAWLING WIT ITH APACHE NUTCH

Ashish Kumar Sinha Deeksha Kushal Motwani Shailender Joseph

slide-2
SLIDE 2

CONTENTS

Web-Crawling Apache Nutch Crawling Algorithm in Apache Nutch RestAPI Demo Conclusion

slide-3
SLIDE 3

Web-Crawling

slide-4
SLIDE 4

What is Web- Crawling?

  • Web-Crawling is a process by which search engines

crawler/spiders/bots scan a website and collect details about each page: titles, images, keywords,

  • ther linked pages, etc.
  • It also discovers updated content on the web, such

as new sites or pages, changes to existing sites, and dead links.

According to Google “The crawling process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As our crawlers visit these websites, they use links on those sites to discover other pages.”

  • http://www.digitalgenx.com/learn/crawling-in-seo.php
slide-5
SLIDE 5

Web-Crawling

  • Crawlers are widely used by search engines like

Google, Yahoo or Bing to retrieve contents for a URL, examine that page for other links, retrieve the URLs for those links, and so on.

  • Google is the first company that published its web-

crawler which has two programs, spider and mite.

  • Spider maintains the seeds and mite is responsible for

downloading webpages.

  • Googlebot and Bingbot are the most popular spiders
  • wned by Google and Bing respectively.

https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence=1&isAllowed=y

slide-6
SLIDE 6

Web-Crawling vs Web-Scraping

  • Web-scraping is closely related to web-crawling but

it is a different technique

  • The main purpose of web-scraper is to convert

unstructured data found

  • n

the internet to structured format for analyzing or for later reference

  • Web-scraping (like web-crawling) often has the

ability to browse different pages and follow links.

  • But (unlike web-crawling) its primary purpose is

extracting the data on those pages and not indexing the web

https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence=1&isAllowed=y

slide-7
SLIDE 7

Process Flo low of f Se Sequentia ial l Web- Cr Crawle ler

Popular open source web-crawlers:

  • Scrapy
  • A python-based web-crawling framework
  • Heritrix
  • A Java based web-crawler designed for

web-archiving.

  • Written by the Internet Archive
  • HTTrack
  • A ‘C’ based web-crawler
  • Developed by Xavier Roche
  • Apache Nutch

https://homepage.divms.uiowa.edu/~psriniva/Papers/crawlingFinal.pdf

slide-8
SLIDE 8

Frontier In Initialization

  • A crawl frontier is a data structure used for storage of

URLs eligible for crawling and supporting such

  • perations as adding URLs and selecting for crawl

(can be seen as priority queue)

  • The initial list of URLs contained in the crawler

frontier are known as seeds:

  • Crawling “seeds” are the pages at which a crawler

commences

  • Seeds should be selected carefully, and multiple seeds

may be necessary to ensure good coverage

https://homepage.divms.uiowa.edu/~psriniva/Papers/crawlingFinal.pdf

slide-9
SLIDE 9

More on Frontier

  • The web-crawler will constantly ask the frontier

what pages to visit.

  • As the crawler visits each of those pages, it will

inform the frontier with the response of each page.

  • The crawler will also update the crawler frontier

with any new hyperlinks contained in those pages it has visited.

  • These hyperlinks are added to the frontier and will

visit those new webpages based on the policies of the crawler frontier.

More on Frontier

Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005

slide-10
SLIDE 10

Fetching

  • The fetcher is a multi-threaded application (capable of

processing more than one tasks in parallel) that employs protocol plugins to retrieve the content of a set of URLs.

  • The Protocols plugin collects information about the

network protocols supported by the system

  • Plug-in is a software component that adds a specific feature to an

existing computer program.

  • Network protocols are formal standards and policies made up of

rules, procedures and formats that defines communication between two or more devices over a network. Network protocols conducts the action, policies, and affairs of the end-to-end process of timely, secured and managed data or network communication.

  • It is analogous to downloading of a page (similar to what a

browser does when you view the page).

Fetching

Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005

slide-11
SLIDE 11

Parsing

Involves one or more of the following:

  • Simple hyperlink/URL extraction
  • Tidying up the HTML content in order to analyze the

HTML tag tree

  • HTML Tag Tree is the hierarchical representation of the HTML

page in the form of tree structure

  • Convert the extracted URL to a canonical form, remove

stopwords from the page’s content and stem the remaining words.

  • Stemming is the process of reducing inflected (or

sometimes derived) words to their word stem, base or root form. Eg: The words consulting, consultant and consultative is stemmed to consult Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005

slide-12
SLIDE 12

Web-Crawling Policies

The behaviour of a web-crawler depends on the outcome of a combination of policies:

  • Selection Policy:
  • It is used to download the appropriate pages, that is, it states

which pages to download.

  • Re-visit Policy:
  • It states when to check for changes to the pages.
  • Two simple (or naive) re-visiting policies:
  • Uniform policy: All pages in the collection are re-visited with the same

frequency.

  • Proportional policy: The pages that change more frequently are re-
  • visited. There are quantitative methods to measure the visiting frequency
  • Politeness Policy states to avoid overloading websites.
  • Parallelization Policy:
  • To avoid repeated downloads of the same page when we run a

parallel crawler (A crawler that runs multiple processes).

Bamrah NHS, Satpute BS, Patil P., 2014. Web Forum Crawling Techniques. International Journal of Computer Applications. 85(36 – 41).

slide-13
SLIDE 13
  • A strategy for a crawler to choose URLs from a crawling

queue.

  • It is related to one of the following two main tasks:
  • Downloading newly discovered webpages not

represented in the index

  • Refreshing copies of pages likely to have important

updates

Ostroumova L. et al., 2014. Crawling Policies Based on Web Page Popularity Prediction. European Conference on Information Retrieval 2014: Advances in Information Retrieval. 100-111

Crawl Ordering Policy (o (or Crawl Strategy)

slide-14
SLIDE 14

Crawl Ordering Policy

  • A strategy for a crawler to choose URLs from a crawling queue.
  • It is related to one of the following two main tasks:
  • Downloading newly discovered webpages not represented in the index
  • Refreshing copies of pages likely to have important updates
  • Breadth-first search:
  • It is a technique where all the links in a page are

followed in sequential order before the crawler follows the child links.

  • Child links can only be generated from the parent

links, crawlers need to save all the parent links in a page in order to follow the child links. Hence, consumes lot of memory

https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence =1&isAllowed=y

slide-15
SLIDE 15
  • A strategy for a crawler to choose URLs from a crawling queue.
  • It is related to one of the following two main tasks:
  • Downloading newly discovered webpages not represented in the index
  • Refreshing copies of pages likely to have important updates
  • Breadth-first search:
  • It is a technique where all the links in a page are followed in sequential order before the crawler follows

the child links.

  • Child links can only be generated from the parent links, crawlers need to save all the parent links in a page

in order to follow the child links. Hence, consumes lot of memory

  • Depth-first search:
  • It is an algorithm where the crawler starts with the

parent link and crawls the child link until it reaches the end and then continues with another parent link

  • It does not have to save all the parents links in a

page, it consumes relatively less memory than BFS

https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence =1&isAllowed=y

Crawl Ordering Policy

slide-16
SLIDE 16
  • Prioritize by indegree
  • The page with the highest number of incoming hyperlinks

from previously downloaded pages, is downloaded next.

  • Incoming links are those links which are coming to our website

from another website

  • Outgoing links are those types of links which are going to another

site from our site

  • Prioritize by PageRank
  • Pages are downloaded in descending order of PageRank, as

estimated based on the pages and links acquired so far by the crawler.

  • PageRank is an algorithm used by Google Search to rank webpages

in their search engine results

Ostroumova L. et al., 2014. Crawling Policies Based on Web Page Popularity Prediction. European Conference on Information Retrieval 2014: Advances in Information Retrieval. 100-111

Crawl Ordering Policy

slide-17
SLIDE 17

Apache Nutch

slide-18
SLIDE 18

What is Apache Nutch

  • Apache Nutch is a highly extensible and scalable open source

web crawler software project hosted by the Apache Software Foundation.

  • Provides a complete, high-quality Web search system, as well

as a flexible, scalable platform for the development of novel Web search engines.

  • Nutch includes:
  • a Web crawler; parsers for Web content; a link-

graph builder; schemas for indexing and search; distributed operation, for high scalability; an extensible, plugin-based architecture.

  • Nutch is coded entirely in the Java programming language,

but data is written in language-independent formats.

  • It has a highly modular architecture, allowing developers to

create plug-ins for media-type parsing, data retrieval, querying and clustering.

slide-19
SLIDE 19
  • Runs on top of Hadoop
  • Scalable: billions of pages are possible
  • some overhead (if scale is not a requirement)
  • not ideal for low latency (Hadoop based == batch processing == high latency )i.e.No guarantee that a

page will be fetched /parsed / indexed within X minutes/hours

  • Customizable / extensible plugin architecture
  • pluggable protocols (document access)
  • URL filters + normalizers
  • The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project. Nutch has

the following advantages over a simple fetcher:

  • 1. Highly scalable and relatively feature-rich crawler.
  • 2. Features like politeness, which obeys robots.txt rules.
  • 3. Robust and scalable – Nutch can run on a cluster of up to 100 machines.
  • 4. Quality – crawling can be biased to fetch "important" pages first.
  • Stemming from Apache Lucene, the project has diversified and now comprises two codebases i.e. Nutch 1.x

and Nutch 2.x.

slide-20
SLIDE 20

Nutch History

In 2002, Nutch project was started by Doug Cutting creator of both Lucene and Hadoop, and Mike Cafarella. Apache Nutch project was the process of building a search engine system that can index 1 billion pages. Problem Faced: system will cost around half a million dollars in hardware, and along with a monthly running cost of $30, 000 approximately, which is very expensive. So, they realized that their project architecture will not be capable enough to the workaround with billions of pages on the web.

So they were looking for a feasible solution which can reduce the implementation cost as well as the problem of storing and processing of large datasets.

slide-21
SLIDE 21
  • In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine

processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

  • 2004/05 MapReduce and distributed file system in Nutch
  • In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene

in June of that same year.

  • 2006 Hadoop split from Nutch, Nutch based on Hadoop (sub-project of Lucene@Apache)
  • 2007 use Tika for MimeType detection, Tika Parser 2010
  • 2008 start NutchBase, …, 2012 released 2.0 based on Gora 0.2
  • 2009 use Solr for indexing
  • Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software

Foundation.

slide-22
SLIDE 22
  • June 2012 : Nutch 1.5.1
  • Oct 2012 : Nutch 2.1
  • In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl.
  • While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no

longer the case. IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5. The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second

slide-23
SLIDE 23

Nutch 1.x Vs Nutch 2.x

  • A well matured, production ready crawler.
  • 1.x enables fine grained configuration,

relying on Apache Hadoop data structures, which are great for batch processing.

  • Need to restart the crawl in-case of failure

Nutch 1.x

  • An emerging alternative but not stable.
  • Storage is abstracted away from any specific

underlying data store by using Apache Gora

  • Can implement an extremely flexibile

model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

Nutch 2.x

slide-24
SLIDE 24

Core Components

  • f Nutch

Nutch search engine consists, very roughly , of the 3 components:

1.

The Crawler ,which discovers and retrieve web pages

2.

The ‘WebDB’ , a custom database that stores known URLs and fetch page contents

3.

The ‘Indexer’ , which dissects pages and builds keyword-based indexes from them NOTE: After the initial creation of an index , it is usual to perform periodic updates of the index, in order to keep it up-to-date. Apart from the above three components, it has a Search Web

  • Application. This application is a JSP application that can be configured

and deployed in a servlet container.

https://cwiki.apache.org/confluence/display/NUTCH/Nutch+-+The+Java+Search+Engine

slide-25
SLIDE 25

Components of Crawlers are as follows:

  • Queue: The queue contains URLs to be fetched. It may be a simple memory based, first in, first out queue, but

usually it’s more advanced and consists of host based queues, a way to prioritize fetching of more important URLs, an ability to store parts or all of the data structures on disk and so on.

  • Fetcher: The fetcher is a component that does the actual work of getting a single piece of content, for

example one single HTML page.

  • Extractor: The extractor is a component responsible for finding new URLs to fetch, for example by extracting

that information from an HTML page. The newly discovered URLs are then normalized and queued to be fetched.

  • Content Repository: The content repository is a place where you store the content. Later the content is

processed and finally indexed.

slide-26
SLIDE 26

Components of Nutch

CrawlDB: The Crawl Database is a data store where Nutch stores every URL, together with the metadata that it knows

  • about. In Hadoop terms it’s a Sequence

file (meaning all records are stored in sequential manner) consisting of tuples of URL and CrawlDatum. Many other data structures in Nutch are of similar structure and no relational databases are

  • used. The rationale behind this kind of

data structure is scalability. The model of simple, flat data storage works well in a distributed environment.

slide-27
SLIDE 27

Components of Nutch

  • Segment is basically a folder

containing all the data related to one fetching batch. Besides the Fetch List, the fetched content itself will be stored there in addition to the extracted plain text version of the content, anchor texts and URLs of

  • utlinks, protocol and document level

metadata etc.

  • The Fetch List is a data structure

(SequenceFile, URL->Crawldatum) that contains the URL, crawldatum tuples that are going to be fetched in

  • ne batch, usually the Fetch list

contents are a subset of CrawlDB that was created by the generate

  • command. FetchList is stored inside

Segment.

slide-28
SLIDE 28

Components of Nutch

  • The Link Database is a data

structure (Sequence file, URL -> Inlinks) that contains all inverted

  • links. In the parsing phase Nutch

can extract outlinks from a document and store them in format source url -> target_url,anchor_text. In the process of inversion we invert the order and combine all instances making the data records in Link Database look like: targetURL -> anchortext[] text so we can use that information later when individual documents are indexed

slide-29
SLIDE 29

Components of Nutch

Apache projects Nutch delegates work to:

  • Hadoop: scalability, job

execution, data serialization (1.x)

  • Tika: detecting and parsing

multiple document formats

  • Solr, ElasticSearch: make crawled

content searchable

  • Gora (and HBase, Cassandra, …):

data storage (2.x)

  • crawler-commons: robots.txt

parsing

slide-30
SLIDE 30

Important Terminologies

A framework that provides an in-memory data model that also supports data persistence to different underlying storage systems: databases (like MySQL), column stores (like HBase and Cassandra), key-value stores (like Redis), and even simple files on HDFS.

Apache Gora

Detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Apache Tika

Open-source REST-API based Enterprise Real-time Search and Analytics Engine Server.

SolR

slide-31
SLIDE 31

Important Terminologies

Renamed as HDFS

NDFS

Stores all information of URL/document which need to crawl

CrawlDB

Stores the incoming links (URLs and anchor texts)

LinkDB

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java.

Hbase

slide-32
SLIDE 32

Crawlin ing Alg lgorit ithm In In Apache Nutch

slide-33
SLIDE 33

Crawling In Apache Nutch

  • Apache Nutch uses Map Reduce for Reliable and Scalable Computing.
  • The Crawling State is maintained in a Data Structure CrawlDB.
  • - CrawlDb : Directory of files containing
  • <URL, CrawlDatum>
  • -- CrawlDatum:
  • <status, date, interval, failures, linkCount, ...>
  • - Status:
  • {db_unfetched, db_fetched, db_gone, linked,
  • fetch_success, fetch_fail, fetch_gone}
slide-34
SLIDE 34

CRAWLING ALGORITHM

  • 8. INDEX

INDEX segment text and inlink anchor text

  • 7. INVERT

INVERT links parsed from segments

  • 6. Repeat

Repeat steps 2 - 5

  • 5. UPDATE

UPDATE CrawlDb from Data parsed from the segment

  • 4. PARSE

PARSE fetched content of a segment

  • 3. FETCH

FETCH a Url into a segment

  • 2. GENERATE

GENERATE a set of Urls to fetch from CrawlDb

  • 1. INJECT

INJECT Urls into CrawlDb, to bootstrap it.

slide-35
SLIDE 35

Crawling Algorithm

slide-36
SLIDE 36

INJECT

MapReduce1: Convert input to CrawlDb Data Base format Input : Flat text file of urls Map(line) → <url, CrawlDatum>; status=db_unfetched Reduce() is identity; Output: directory of temporary files

slide-37
SLIDE 37

INJECT

MapReduce2: Merge into existing CrawlDb Database Input: output of MapReduce1 and existing Database files Map() is identity. Reduce: merge CrawlDatum's into single entry Output: new version of CrawlDb Database

slide-38
SLIDE 38

Step1 : Injector injects the list of seed URLs into the CrawlDb

slide-39
SLIDE 39

GENERATE

MapReduce1: Select urls due for fetch Input: CrawlDb files Map() → if date≥now, invert to <CrawlDatum, url> Partition() by value, hash () to randomize Reduce -> Compare() order by decreasing CrawlDatum.linkCount Output: Top-N most-linked entries

CrawlDatum.link count

  • - Outlink count from that Url

Top-N Most-linked entries:

  • - Urls which have most no. of inlinks

(N being the the max no. of Urls to fetch)

slide-40
SLIDE 40

Step2 : Generator takes the list of seed URLs from CrawlDB, forms fetch list, adds crawl_generate folder into the segments

slide-41
SLIDE 41

MapReduce2: Prepare for fetch

Input: Output of MapReduce1 Map() is invert Partition() by host Reduce() is identity. Output: Set of <url, CrawlDatum> files to fetch in parallel

slide-42
SLIDE 42

FETCH

MapReduce: Fetch a set of urls Input: <url, CrawlDatum> Partition by host, Sort by hash Map(url, CrawlDatum) → <url, FetcherOutput> ( FetcherOutput: <CrawlDatum, Content> ) Multi-threaded, Async map implementation calls existing Nutch protocol plugins Reduce() is identity Output: two files: <url, CrawlDatum>, <url, Content>

The fetcher is a multi-threaded application that employs protocol plugins to retrieve the content of a set of URLs.

slide-43
SLIDE 43

Step3 : These fetch lists are used by fetchers to fetch the raw content of the document. It is then stored in segments.

slide-44
SLIDE 44

PARSE

MapReduce: Parse content Input: <url, content> files from Fetch Map(url, Content) → <url, Parse> ( Parse: <ParseText, ParseData> ) Calls existing Nutch parser plugins Reduce() is identity Output: Split in three-- <url, ParseText>, <url, ParseData> and <url, CrawlDatum> for outlinks

Parser plugins are employed to extract text links and other metadata from the raw binary content.

slide-45
SLIDE 45

Step4 : Parser is called to parse the content of the document and parsed content is stored back in segments.

slide-46
SLIDE 46

Update CrawlDb

MapReduce: Integrate Fetch & Parse output into CrawlDb Input: <url,CrawlDatum> Existing database plus Fetch & Parse output Map() is identity Reduce() merges all entries into a single new entry

  • verwrite previous CrawlDb status with new status update from Fetch

Sum count of links from Parse with previous links from CrawlDb Output: new CrawlDb

slide-47
SLIDE 47

Step5 : The links are inverted in the link graph and stored in LinkDB

slide-48
SLIDE 48

INVERT LINKS

MapReduce: Compute inlinks for all urls Input: <Url, ParseData> containing page outlinks Map(source_Url, ParseData> → <destination_Url, Inlinks> ( Inlinks: <source_Url, AnchorText>* ) Collect a single-element Inlinks for each outlink Limit number of outlinks per page Reduce() appends inlinks Output: <Url, Inlinks> - A complete link inversion

* AnchorText

  • -link text is the

visible, clickable text in an HTML hyperlink

slide-49
SLIDE 49

Step6 : Indexing the terms present in segments is done and indices are updated in the segments

slide-50
SLIDE 50

INDEX

MapReduce: create Indexes

Input: Multiple files, values wrapped in <Class, Object> <url, ParseData> from Parse for title, metadata, etc. ; <url, ParseText> from Parse for text <url, Inlinks> from Invert for anchors ; <url, CrawlDatum> from Fetch, for Fetch date Map() is identity Reduce() creates a Lucene Document Call existing Nutch indexing plugins Output: Build Lucene index; copy to file system at end

*Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website.

slide-51
SLIDE 51

Step7 : Information on the newly fetched documents are updated in the CrwalDB

slide-52
SLIDE 52

REST API

slide-53
SLIDE 53

What is it ?

REST means REpresentational State Transfer

slide-54
SLIDE 54

RE REpresentational ? State ? Transfer ?

  • It represent the state of database at a time. But how???
  • REST is an architectural style which is based on web-standards and the

HTTP protocol.

  • In a REST based architecture everything is a Resource.
  • A resource is accessed via a common interface based on the HTTP

standard methods.

  • You typically have a REST server which provides access to the resources

and a REST client which accesses and modifies the REST resources.

slide-55
SLIDE 55

RE REpresentational ? State ? Transfer ?

  • Every resource should support the HTTP common operations.
  • Resources are identified by global IDs (which are typically URIs or URLs).
  • REST allows that resources have different representations, e.g., text,

XML, JSON etc.

  • Stateless in nature. Excellent for distributed system.
  • Stateless components can be freely redeployed if something fails, and

they can scale to accommodate load changes.

  • This is because any request can be directed to any instance of a

component.

slide-56
SLIDE 56

Architecture

Source:https://www.slideshare.net/AniruddhBhilvare/an-introduction-to-rest-api

slide-57
SLIDE 57

HTTP Methods

The PUT, GET, POST and DELETE methods are typically used in REST based architectures. The following table gives an explanation of these operations:

slide-58
SLIDE 58

Nutch 2.x REST API

Refer to - https://cwiki.apache.org/confluence/display/NUTCH/NutchRESTAPI

All components listed above use the nutch API. The users can utilize the API via two approaches, which depends on the task at hand. 1.Through the nutch Shell Script for administrative tasks, such as creating and maintaining indexes 2.Through the Search Web Application, in order to perform a search using keywords

slide-59
SLIDE 59

STEP BY STEP DEMO OF Nutch 2.x

Refer to - https://www.youtube.com/wat ch?v=AvyBiGuBc64

slide-60
SLIDE 60

Installing Hbase and Nutch 2.3

slide-61
SLIDE 61

Configuring Hbase by stating root directory for hbase and zookeeper

slide-62
SLIDE 62

Starting hbase

slide-63
SLIDE 63

Starting hbase

  • Changing gora.properties - telling that we are using hbase as its database -

https://www.youtube.com/embed/AvyBiGuBc64?start=1020&end=1086

  • Adding hbase dependency to nutch by updating file ivy.xml -

https://www.youtube.com/embed/AvyBiGuBc64?start=1092&end=1152

  • Compile nutch - run command ant runtime in nutch root directory
slide-64
SLIDE 64

Adding Crawler properties

  • Agent Name
  • Stating Storage
  • Stating the plugins use
slide-65
SLIDE 65

Adding Crawler properties

  • Seed with some domain name
  • Inject the seeded url in runtime local folder with the following

command - seed nutch bin/nutch inject urls

  • Next we fetch the urls by following two steps

○ bin/nutch generate -topN 1000 (will give us the top 1000 urls with most

number of inlinks)

○ bin/fetch fetch -all (will fetch all the URLs generated in previous steps)

  • We now parse the fetched URLs using - bin/nutch parse -all
  • And finally update our database for the next iteration - bin/nutch

updatedb -all

  • We reiterate through fetch parse and updateDB step to get more

urls

  • And finally index the crawled data into a database like structure

using ElasticSearch or SOLR, etc.

slide-66
SLIDE 66

Adding Crawler properties

  • Seed with some domain name
  • Inject the seeded url in runtime local folder with the following

command - seed nutch bin/nutch inject urls

  • Next we fetch the urls by following two steps

○ bin/nutch generate -topN 1000 (will give us the top 1000 urls with most

number of inlinks)

○ bin/fetch fetch -all (will fetch all the URLs generated in previous steps)

  • We now parse the fetched URLs using - bin/nutch parse -all
  • And finally update our database for the next iteration - bin/nutch

updatedb -all

  • We reiterate through fetch parse and updateDB step to get more

urls

  • And finally index the crawled data into a database like structure

using ElasticSearch or SOLR, etc.

slide-67
SLIDE 67

Conclusions

Nutch is a great web-search system because:

  • Usual Reasons:

Mature, business-friendly licence, community

  • Scalability:

Tried and tested on very large scale

Hadoop cluster: installation and skills

  • Features:

Can index with SOLR, ElasticSearch, etc.

Has a PageRank Implementation

Can be extended with plugins However there are few issues in Nutch in both 1.x (restarting in case of system crash) and 2.x (unstable) that needs to be worked on.