Frontera: open source, large scale web crawling framework
Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com
Frontera: open source, large scale web crawling framework Alexander - - PowerPoint PPT Presentation
Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com Sziasztok rsztvev k! Born in Yekaterinburg, RU 5 years at Yandex, search quality department: social and QA
Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com
quality department: social and QA search, snippets.
research team: automatic false positive solving, large scale prediction of malicious download attempts.
2
statistics about hosts and their sizes.
1-click distance documents, next 2-clicks, and so on,
hosts with less than 100 crawled documents.
3
year)
* - отчет OECD Communications Outlook 2013
4
scalability).
applications. * - network operations in Scrapy are implemented asynchronously, based on the same Twisted.Internet
5
Kafka topic
SW
Crawling strategy workers Storage workers
6
DB
number of links from some host, along with usage of simple prioritization models, it turns out queue is flooded with URLs from the same host.
spider resources.
host (optionally per-IP) queue and metering algorithm: URLs from big hosts are cached in memory.
7
first visiting of previously unknown hosts, therefore generating huge amount of DNS request.
downloading node, with upstream set to Verizon and OpenDNS.
8
resolve DNS name to IP.
request is sent to DNS server in it’s own thread, which is blocking.
errors related to DNS name resolution and timeouts.
for thread pool size and timeout adjustment.
9
hundreds of links in average.
needs to be checked if they weren’t already crawled (to avoid repetitive visiting).
After increase of table size, we had to move to HDDs, and response times dramatically grew up.
keys in HBase.
average host states into one block.
10
between workers Kafka and HBase up to 1Gbit/s.
protocol for HBase communication.
Kafka using Snappy.
11
share of requests and network throughput.
requirement.
in strategy worker.
was partitioned by host, to avoid cache overlap between workers.
12
requested from HBase,
cache is flushed to HBase.
elements, flush and cleanup happens.
(LRU) algorithm is a good fit there.
hosts.
solved using scoring model taking into account known document count per host.
14
very huge hosts (>20M docs)
flooded with pages from few huge hosts, because of queue design and scoring model used.
jobs:
than 100 documents.
15
from about 100 websites in parallel.
CDH (100% Open source Hadoop package)
17
storage of Cloudera Manager.
move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).
equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites
expected),
50M pages
Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
updating of DB state.
(sqlalchemy, HBase is included).
document has many URLs, which to use?
community, ease of customization.
24
partitioning, offsets mechanism.
module.
at most one spider.
scrapinghub/distributed-frontera
26
and Kafka. Communicating using sockets.
website content changes.
services.
27
historically first attempt to implement web scale web crawler using Python.
CPU, network, disks.
company where Scrapy was created.
Software Foundation project.
28
http://scrapinghub.com/jobs/
29
Thank you! Alexander Sibiryakov, sibiryakov@scrapinghub.com