CRAWLING WIT ITH APACHE NUTCH
Ashish Kumar Sinha Deeksha Kushal Motwani Shailender Joseph
CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender - - PowerPoint PPT Presentation
Ashish Kumar Sinha CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache Nutch Crawling Algorithm in Apache Nutch CONTENTS RestAPI Demo Conclusion Web-Crawling Web-Crawling is a process by
Ashish Kumar Sinha Deeksha Kushal Motwani Shailender Joseph
Web-Crawling Apache Nutch Crawling Algorithm in Apache Nutch RestAPI Demo Conclusion
crawler/spiders/bots scan a website and collect details about each page: titles, images, keywords,
as new sites or pages, changes to existing sites, and dead links.
According to Google “The crawling process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As our crawlers visit these websites, they use links on those sites to discover other pages.”
Google, Yahoo or Bing to retrieve contents for a URL, examine that page for other links, retrieve the URLs for those links, and so on.
crawler which has two programs, spider and mite.
downloading webpages.
https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence=1&isAllowed=y
it is a different technique
unstructured data found
the internet to structured format for analyzing or for later reference
ability to browse different pages and follow links.
extracting the data on those pages and not indexing the web
https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence=1&isAllowed=y
Process Flo low of f Se Sequentia ial l Web- Cr Crawle ler
Popular open source web-crawlers:
web-archiving.
https://homepage.divms.uiowa.edu/~psriniva/Papers/crawlingFinal.pdf
URLs eligible for crawling and supporting such
(can be seen as priority queue)
frontier are known as seeds:
commences
may be necessary to ensure good coverage
https://homepage.divms.uiowa.edu/~psriniva/Papers/crawlingFinal.pdf
what pages to visit.
inform the frontier with the response of each page.
with any new hyperlinks contained in those pages it has visited.
visit those new webpages based on the policies of the crawler frontier.
Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005
processing more than one tasks in parallel) that employs protocol plugins to retrieve the content of a set of URLs.
network protocols supported by the system
existing computer program.
rules, procedures and formats that defines communication between two or more devices over a network. Network protocols conducts the action, policies, and affairs of the end-to-end process of timely, secured and managed data or network communication.
browser does when you view the page).
Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005
Involves one or more of the following:
HTML tag tree
page in the form of tree structure
stopwords from the page’s content and stem the remaining words.
sometimes derived) words to their word stem, base or root form. Eg: The words consulting, consultant and consultative is stemmed to consult Nutch: an Open-Source Platform for Web Search by Doug Cutting, 2005
The behaviour of a web-crawler depends on the outcome of a combination of policies:
which pages to download.
frequency.
parallel crawler (A crawler that runs multiple processes).
Bamrah NHS, Satpute BS, Patil P., 2014. Web Forum Crawling Techniques. International Journal of Computer Applications. 85(36 – 41).
queue.
represented in the index
updates
Ostroumova L. et al., 2014. Crawling Policies Based on Web Page Popularity Prediction. European Conference on Information Retrieval 2014: Advances in Information Retrieval. 100-111
followed in sequential order before the crawler follows the child links.
links, crawlers need to save all the parent links in a page in order to follow the child links. Hence, consumes lot of memory
https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence =1&isAllowed=y
the child links.
in order to follow the child links. Hence, consumes lot of memory
parent link and crawls the child link until it reaches the end and then continues with another parent link
page, it consumes relatively less memory than BFS
https://www.theseus.fi/bitstream/handle/10024/93110/Panta_Deepak.pdf?sequence =1&isAllowed=y
from previously downloaded pages, is downloaded next.
from another website
site from our site
estimated based on the pages and links acquired so far by the crawler.
in their search engine results
Ostroumova L. et al., 2014. Crawling Policies Based on Web Page Popularity Prediction. European Conference on Information Retrieval 2014: Advances in Information Retrieval. 100-111
What is Apache Nutch
web crawler software project hosted by the Apache Software Foundation.
as a flexible, scalable platform for the development of novel Web search engines.
graph builder; schemas for indexing and search; distributed operation, for high scalability; an extensible, plugin-based architecture.
but data is written in language-independent formats.
create plug-ins for media-type parsing, data retrieval, querying and clustering.
page will be fetched /parsed / indexed within X minutes/hours
the following advantages over a simple fetcher:
and Nutch 2.x.
Nutch History
In 2002, Nutch project was started by Doug Cutting creator of both Lucene and Hadoop, and Mike Cafarella. Apache Nutch project was the process of building a search engine system that can index 1 billion pages. Problem Faced: system will cost around half a million dollars in hardware, and along with a monthly running cost of $30, 000 approximately, which is very expensive. So, they realized that their project architecture will not be capable enough to the workaround with billions of pages on the web.
So they were looking for a feasible solution which can reduce the implementation cost as well as the problem of storing and processing of large datasets.
processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.
in June of that same year.
Foundation.
longer the case. IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5. The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second
Nutch 1.x Vs Nutch 2.x
relying on Apache Hadoop data structures, which are great for batch processing.
Nutch 1.x
underlying data store by using Apache Gora
model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.
Nutch 2.x
Core Components
Nutch search engine consists, very roughly , of the 3 components:
1.
The Crawler ,which discovers and retrieve web pages
2.
The ‘WebDB’ , a custom database that stores known URLs and fetch page contents
3.
The ‘Indexer’ , which dissects pages and builds keyword-based indexes from them NOTE: After the initial creation of an index , it is usual to perform periodic updates of the index, in order to keep it up-to-date. Apart from the above three components, it has a Search Web
and deployed in a servlet container.
https://cwiki.apache.org/confluence/display/NUTCH/Nutch+-+The+Java+Search+Engine
Components of Crawlers are as follows:
usually it’s more advanced and consists of host based queues, a way to prioritize fetching of more important URLs, an ability to store parts or all of the data structures on disk and so on.
example one single HTML page.
that information from an HTML page. The newly discovered URLs are then normalized and queued to be fetched.
processed and finally indexed.
Components of Nutch
CrawlDB: The Crawl Database is a data store where Nutch stores every URL, together with the metadata that it knows
file (meaning all records are stored in sequential manner) consisting of tuples of URL and CrawlDatum. Many other data structures in Nutch are of similar structure and no relational databases are
data structure is scalability. The model of simple, flat data storage works well in a distributed environment.
Components of Nutch
containing all the data related to one fetching batch. Besides the Fetch List, the fetched content itself will be stored there in addition to the extracted plain text version of the content, anchor texts and URLs of
metadata etc.
(SequenceFile, URL->Crawldatum) that contains the URL, crawldatum tuples that are going to be fetched in
contents are a subset of CrawlDB that was created by the generate
Segment.
Components of Nutch
structure (Sequence file, URL -> Inlinks) that contains all inverted
can extract outlinks from a document and store them in format source url -> target_url,anchor_text. In the process of inversion we invert the order and combine all instances making the data records in Link Database look like: targetURL -> anchortext[] text so we can use that information later when individual documents are indexed
Components of Nutch
Apache projects Nutch delegates work to:
execution, data serialization (1.x)
multiple document formats
content searchable
data storage (2.x)
parsing
Important Terminologies
A framework that provides an in-memory data model that also supports data persistence to different underlying storage systems: databases (like MySQL), column stores (like HBase and Cassandra), key-value stores (like Redis), and even simple files on HDFS.
Apache Gora
Detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
Apache Tika
Open-source REST-API based Enterprise Real-time Search and Analytics Engine Server.
SolR
Important Terminologies
Renamed as HDFS
NDFS
Stores all information of URL/document which need to crawl
CrawlDB
Stores the incoming links (URLs and anchor texts)
LinkDB
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java.
Hbase
INDEX segment text and inlink anchor text
INVERT links parsed from segments
Repeat steps 2 - 5
UPDATE CrawlDb from Data parsed from the segment
PARSE fetched content of a segment
FETCH a Url into a segment
GENERATE a set of Urls to fetch from CrawlDb
INJECT Urls into CrawlDb, to bootstrap it.
MapReduce1: Convert input to CrawlDb Data Base format Input : Flat text file of urls Map(line) → <url, CrawlDatum>; status=db_unfetched Reduce() is identity; Output: directory of temporary files
MapReduce2: Merge into existing CrawlDb Database Input: output of MapReduce1 and existing Database files Map() is identity. Reduce: merge CrawlDatum's into single entry Output: new version of CrawlDb Database
Step1 : Injector injects the list of seed URLs into the CrawlDb
MapReduce1: Select urls due for fetch Input: CrawlDb files Map() → if date≥now, invert to <CrawlDatum, url> Partition() by value, hash () to randomize Reduce -> Compare() order by decreasing CrawlDatum.linkCount Output: Top-N most-linked entries
CrawlDatum.link count
Top-N Most-linked entries:
(N being the the max no. of Urls to fetch)
Step2 : Generator takes the list of seed URLs from CrawlDB, forms fetch list, adds crawl_generate folder into the segments
Input: Output of MapReduce1 Map() is invert Partition() by host Reduce() is identity. Output: Set of <url, CrawlDatum> files to fetch in parallel
MapReduce: Fetch a set of urls Input: <url, CrawlDatum> Partition by host, Sort by hash Map(url, CrawlDatum) → <url, FetcherOutput> ( FetcherOutput: <CrawlDatum, Content> ) Multi-threaded, Async map implementation calls existing Nutch protocol plugins Reduce() is identity Output: two files: <url, CrawlDatum>, <url, Content>
The fetcher is a multi-threaded application that employs protocol plugins to retrieve the content of a set of URLs.
Step3 : These fetch lists are used by fetchers to fetch the raw content of the document. It is then stored in segments.
MapReduce: Parse content Input: <url, content> files from Fetch Map(url, Content) → <url, Parse> ( Parse: <ParseText, ParseData> ) Calls existing Nutch parser plugins Reduce() is identity Output: Split in three-- <url, ParseText>, <url, ParseData> and <url, CrawlDatum> for outlinks
Parser plugins are employed to extract text links and other metadata from the raw binary content.
Step4 : Parser is called to parse the content of the document and parsed content is stored back in segments.
MapReduce: Integrate Fetch & Parse output into CrawlDb Input: <url,CrawlDatum> Existing database plus Fetch & Parse output Map() is identity Reduce() merges all entries into a single new entry
Sum count of links from Parse with previous links from CrawlDb Output: new CrawlDb
Step5 : The links are inverted in the link graph and stored in LinkDB
MapReduce: Compute inlinks for all urls Input: <Url, ParseData> containing page outlinks Map(source_Url, ParseData> → <destination_Url, Inlinks> ( Inlinks: <source_Url, AnchorText>* ) Collect a single-element Inlinks for each outlink Limit number of outlinks per page Reduce() appends inlinks Output: <Url, Inlinks> - A complete link inversion
* AnchorText
visible, clickable text in an HTML hyperlink
Step6 : Indexing the terms present in segments is done and indices are updated in the segments
MapReduce: create Indexes
Input: Multiple files, values wrapped in <Class, Object> <url, ParseData> from Parse for title, metadata, etc. ; <url, ParseText> from Parse for text <url, Inlinks> from Invert for anchors ; <url, CrawlDatum> from Fetch, for Fetch date Map() is identity Reduce() creates a Lucene Document Call existing Nutch indexing plugins Output: Build Lucene index; copy to file system at end
*Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website.
Step7 : Information on the newly fetched documents are updated in the CrwalDB
REST means REpresentational State Transfer
HTTP protocol.
standard methods.
and a REST client which accesses and modifies the REST resources.
XML, JSON etc.
they can scale to accommodate load changes.
component.
Architecture
Source:https://www.slideshare.net/AniruddhBhilvare/an-introduction-to-rest-api
HTTP Methods
The PUT, GET, POST and DELETE methods are typically used in REST based architectures. The following table gives an explanation of these operations:
Refer to - https://cwiki.apache.org/confluence/display/NUTCH/NutchRESTAPI
All components listed above use the nutch API. The users can utilize the API via two approaches, which depends on the task at hand. 1.Through the nutch Shell Script for administrative tasks, such as creating and maintaining indexes 2.Through the Search Web Application, in order to perform a search using keywords
Refer to - https://www.youtube.com/wat ch?v=AvyBiGuBc64
https://www.youtube.com/embed/AvyBiGuBc64?start=1020&end=1086
https://www.youtube.com/embed/AvyBiGuBc64?start=1092&end=1152
command - seed nutch bin/nutch inject urls
○ bin/nutch generate -topN 1000 (will give us the top 1000 urls with most
number of inlinks)
○ bin/fetch fetch -all (will fetch all the URLs generated in previous steps)
updatedb -all
urls
using ElasticSearch or SOLR, etc.
command - seed nutch bin/nutch inject urls
○ bin/nutch generate -topN 1000 (will give us the top 1000 urls with most
number of inlinks)
○ bin/fetch fetch -all (will fetch all the URLs generated in previous steps)
updatedb -all
urls
using ElasticSearch or SOLR, etc.
Nutch is a great web-search system because:
○
Mature, business-friendly licence, community
○
Tried and tested on very large scale
○
Hadoop cluster: installation and skills
○
Can index with SOLR, ElasticSearch, etc.
○
Has a PageRank Implementation
○
Can be extended with plugins However there are few issues in Nutch in both 1.x (restarting in case of system crash) and 2.x (unstable) that needs to be worked on.