Frontera: open source, large scale web crawling framework Alexander - - PowerPoint PPT Presentation

frontera open source large scale web crawling framework
SMART_READER_LITE
LIVE PREVIEW

Frontera: open source, large scale web crawling framework Alexander - - PowerPoint PPT Presentation

Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com Sziasztok rsztvev k! Born in Yekaterinburg, RU 5 years at Yandex, search quality department: social and QA


slide-1
SLIDE 1

Frontera: open source, large scale web crawling framework

Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com

slide-2
SLIDE 2
  • Born in Yekaterinburg, RU
  • 5 years at Yandex, search

quality department: social and QA search, snippets.

  • 2 years at Avast! antivirus,

research team: automatic false positive solving, large scale prediction of malicious download attempts.

Sziasztok résztvevők!

2

slide-3
SLIDE 3

Task

  • Crawl Spanish web to gather

statistics about hosts and their sizes.

  • Limit crawl to .es zone.
  • Breadth-first strategy: first crawl

1-click distance documents, next 2-clicks, and so on,

  • Finishing condition: absence of

hosts with less than 100 crawled documents.

  • Low costs.

3

slide-4
SLIDE 4

Spanish internet (.es) in 2012

  • Domain names registered - 1,56М (39% growth per

year)

  • Web server in zone - 283,4K (33,1%)
  • Hosts - 4,2M (21%)
  • Spanish web sites in DMOZ catalog - 22043


* - отчет OECD Communications Outlook 2013

4

slide-5
SLIDE 5

Solution

  • Scrapy* - network operations.
  • Apache Kafka - data bus (offsets, partitioning).
  • Apache HBase - storage (random access, linear scanning,

scalability).

  • Twisted.Internet - library for async primitives for use in workers.
  • Snappy - efficient compression algorithm for IO-bounded

applications. * - network operations in Scrapy are implemented asynchronously, based on the same Twisted.Internet

5

slide-6
SLIDE 6

Architecture

Kafka topic

SW

Crawling strategy workers Storage workers

6

DB

slide-7
SLIDE 7
  • 1. Big and small hosts

problem

  • When crawler comes to huge

number of links from some host, along with usage of simple prioritization models, it turns out queue is flooded with URLs from the same host.

  • That causes underuse of

spider resources.

  • We adopted additional per-

host (optionally per-IP) queue and metering algorithm: URLs from big hosts are cached in memory.

7

slide-8
SLIDE 8
  • 3. DDoS DNS service

Amazon AWS

  • Breadth-first strategy assumes

first visiting of previously unknown hosts, therefore generating huge amount of DNS request.

  • Recursive DNS server on each

downloading node, with upstream set to Verizon and OpenDNS.

  • We used dnsmasq.

8

slide-9
SLIDE 9
  • 4. Tuning Scrapy thread pool’а

for efficient DNS resolution

  • Scrapy uses a thread pool to

resolve DNS name to IP.

  • When ip is absent in cache,

request is sent to DNS server in it’s own thread, which is blocking.

  • Scrapy reported numerous

errors related to DNS name resolution and timeouts.

  • We added option to Scrapy

for thread pool size and timeout adjustment.

9

slide-10
SLIDE 10
  • 5. Overloaded HBase region

servers during state check

  • Crawler extracts from document

hundreds of links in average.

  • Before adding this links to queue, they

needs to be checked if they weren’t already crawled (to avoid repetitive visiting).

  • On small volumes SSDs were just fine.

After increase of table size, we had to move to HDDs, and response times dramatically grew up.

  • Host-local fingerprint function for

keys in HBase.

  • Tuning HBase block cache to fit

average host states into one block.

10

slide-11
SLIDE 11
  • 6. Intensive network traffic

from workers to services

  • We noticed throughput

between workers Kafka and HBase up to 1Gbit/s.

  • Switched to Thrift compact

protocol for HBase communication.

  • Message compression in

Kafka using Snappy.

11

slide-12
SLIDE 12
  • 7. Further query and traffic
  • ptimizations to HBase
  • State check required lion’s

share of requests and network throughput.

  • Consistency was another

requirement.

  • We created local state cache

in strategy worker.

  • For consistency, spider log

was partitioned by host, to avoid cache overlap between workers.

12

slide-13
SLIDE 13

State cache

  • All operations are batched:
  • If key is absent in cache, it’s

requested from HBase,

  • every ~4K documents

cache is flushed to HBase.

  • When achieving 3M (~1Гб)

elements, flush and cleanup happens.

  • It seems Least-Recently-Used

(LRU) algorithm is a good fit there.

slide-14
SLIDE 14

Spider priority queue (slot)

  • Cell has an array of:

  • fingerprint, 

  • Crc32(hostname), 

  • URL, 

  • score
  • Dequeueing top N.
  • Such design is prone to huge

hosts.

  • Partially this problem can be

solved using scoring model taking into account known document count per host.

14

slide-15
SLIDE 15
  • 8. Problem of big and small

hosts (strikes back!)

  • During crawling we’ve found few

very huge hosts (>20M docs)

  • All queue partitions were

flooded with pages from few huge hosts, because of queue design and scoring model used.

  • We made two MapReduce

jobs:

  • queue shuffling,
  • limiting all hosts to no more

than 100 documents.

15

slide-16
SLIDE 16
  • Single-thread Scrapy spider gives 1200 pages/min.

from about 100 websites in parallel.

  • Spiders to workers ratio is 4:1 (without content)
  • 1 Gb of RAM for every SW (state cache, tunable).
  • Example:
  • 12 spiders ~ 14.4K pages/min.,
  • 3 SW and 3 DB workers,
  • Total 18 cores.

Hardware requirements

slide-17
SLIDE 17
  • Apache HBase,
  • Apache Kafka,
  • Python 2.7+,
  • Scrapy 0.24+,
  • DNS Service.

Software requirements

CDH (100% Open source Hadoop package)

17

slide-18
SLIDE 18

Maintaining Cloudera Hadoop on Amazon EC2

  • CDH is very sensitive to free space on root partition, parcels, and

storage of Cloudera Manager.

  • We’ve moved it using symbolic links to separate EBS partition.
  • EBS should be at least 30Gb, base IOPS should be enough.
  • Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD).
  • After one week of crawling, we ran out of space, and started to

move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).

slide-19
SLIDE 19

Spanish (.es) internet crawl results

  • fnac.es, rakuten.es, adidas.es,

equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites

  • 68.7K domains found (~600K

expected),

  • 46.5M crawled pages overall,
  • 1.5 months,
  • 22 websites with more than

50M pages

slide-20
SLIDE 20

where are the rest of web servers?!

slide-21
SLIDE 21

Bow-tie model

  • A. Broder et al. / Computer Networks 33 (2000) 309-320
slide-22
SLIDE 22
  • Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
slide-23
SLIDE 23

Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014

slide-24
SLIDE 24
  • Online operation: scheduling of new batch,

updating of DB state.

  • Storage abstraction: write your own backend

(sqlalchemy, HBase is included).

  • Canonical URLs resolution abstraction: each

document has many URLs, which to use?

  • Scrapy ecosystem: good documentation, big

community, ease of customization.

Main features

24

slide-25
SLIDE 25
  • Communication layer is Apache Kafka: topic

partitioning, offsets mechanism.

  • Crawling strategy abstraction: crawling goal, url
  • rdering, scoring model is coded in separate

module.

  • Polite by design: each website is downloaded by

at most one spider.

  • Python: workers, spiders.

Main features

slide-26
SLIDE 26

References

  • Distributed Frontera. https://github.com/

scrapinghub/distributed-frontera

  • Frontera. https://github.com/scrapinghub/frontera
  • Documentation:
  • http://distributed-frontera.readthedocs.org/
  • http://frontera.readthedocs.org/

26

slide-27
SLIDE 27

Future plans

  • Lighter version, without HBase

and Kafka. Communicating using sockets.

  • Revisiting strategy out-of-box.
  • Watchdog solution: tracking

website content changes.

  • PageRank or HITS strategy.
  • Own HTML and URL parsers.
  • Integration into Scrapinghub

services.

  • Testing on larger volumes.

27

slide-28
SLIDE 28

Contribute!

  • Distributed Frontera is a

historically first attempt to implement web scale web crawler using Python.

  • Truly resource-intensive task:

CPU, network, disks.

  • Made in Scrapinghub, a

company where Scrapy was created.

  • A plans to become an Apache

Software Foundation project.

28

slide-29
SLIDE 29

We’re hiring!

http://scrapinghub.com/jobs/

29

slide-30
SLIDE 30

Köszönöm!

Thank you! Alexander Sibiryakov, sibiryakov@scrapinghub.com