Frontera: Large-Scale Open Source Web Crawling Framework
Alexander Sibiryakov, 20 July 2015 sibiryakov@scrapinghub.com
Frontera: Large-Scale Open Source Web Crawling Framework Alexander - - PowerPoint PPT Presentation
Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015 sibiryakov@scrapinghub.com Hola los participantes! Born in Yekaterinburg, RU 5 years at Yandex, search quality department: social and QA
Alexander Sibiryakov, 20 July 2015 sibiryakov@scrapinghub.com
quality department: social and QA search, snippets.
research team: automatic false positive solving, large scale prediction of malicious download attempts.
2
–Wikipedia: Web Crawler article, July 2015
3
«A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.».
tophdart.com
pages/week, and identify frequently changing HUB pages.
crawling and had no crawl frontier capabilities, out of the box,
Apache Nutch instead of Scrapy.
Hyperlink-Induced Topic Search, Jon Kleinberg, 1999
5
and when to stop.
websites (parallel downloading),
mode.
6
updating of DB state.
(sqlalchemy, HBase is included).
document has many URLs, which to use?
community, ease of customization.
7
from the spider
revisiting)
8
9
set of custom scheduler and spider middleware for Scrapy.
Scrapy, and can be used separately.
management and fetching
10
repo,
add Frontera’s spider middleware,
to track changes).
Web.
statistics, structure of graph, tracking domain count, etc.
about that topic.
pages that are big hubs, and frequently changing in time.
12
Kafka topic SW DB Strategy workers DB workers
13
partitioning, offsets mechanism.
module.
at most one spider.
CDH (100% Open source Hadoop distribution)
from about 100 websites in parallel.
for internal communication. Solution: increase count of network interfaces.
HDDs, and free RAM would be great for caching the priority queue.
performance issue, make sure that Kafka brokers has enough IOPS.
Consult http://distributed-frontera.readthedocs.org/ for more information.
equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites
pages For more info and graphs check the poster
strategy,
parsing,
paid services,
Thank you! Alexander Sibiryakov, sibiryakov@scrapinghub.com