Crawler.NET: A component-based distributed framework for web - - PowerPoint PPT Presentation

crawler net a component based distributed framework for
SMART_READER_LITE
LIVE PREVIEW

Crawler.NET: A component-based distributed framework for web - - PowerPoint PPT Presentation

Crawler.NET: A component-based distributed framework for web traversal Levente Hunyadi (BME AAIT) March 23, 2007 Crawler.NET A component-based distributed framework for web traversal Motivation Introduction The Web: Motivation a source


slide-1
SLIDE 1

Crawler.NET

A component-based distributed framework for web traversal

Crawler.NET: A component-based distributed framework for web traversal

Levente Hunyadi (BME AAIT)

March 23, 2007

slide-2
SLIDE 2

Crawler.NET

A component-based distributed framework for web traversal

Motivation

Introduction Motivation Objectives Architecture Component framework Crawler application Conclusions

The Web:

  • a source of distributed information
  • a giant set of semi-structured data

⇒ search engines are invaluable to locate information

slide-3
SLIDE 3

Crawler.NET

A component-based distributed framework for web traversal

Motivation

Introduction Motivation Objectives Architecture Component framework Crawler application Conclusions

  • up-to-date index database

  • efficient traversal

  • parallelization

  • distributed architecture

  • increased complexity
slide-4
SLIDE 4

Crawler.NET

A component-based distributed framework for web traversal

Objectives

Introduction Motivation Objectives Architecture Component framework Crawler application Conclusions

  • scalability
  • easy configuration and management
  • support for extension
  • robustness, resilience to failures
slide-5
SLIDE 5

Crawler.NET

A component-based distributed framework for web traversal

Architectural overview

Introduction Motivation Objectives Architecture Component framework Crawler application Conclusions

Two separate layers: Component framework General tasks

  • component interaction
  • lifecycle management
  • transparent

interprocess communication Crawling application Field-specific issues

  • downloading

documents

  • extracting hyperlinks
  • administering page

references

  • scheduling requests
slide-6
SLIDE 6

Crawler.NET

A component-based distributed framework for web traversal

Design

Introduction Motivation Objectives Architecture Component framework Crawler application Conclusions

  • the component framework exposes general

component skeletons that realize common behavior

  • new, field-specific components are created by

means of inheritance

  • the framework provides loose coupling between

components Advantages: + simpler and faster development +

  • penness for extension
slide-7
SLIDE 7

Crawler.NET

A component-based distributed framework for web traversal

Building blocks of the architecture

Introduction Component framework Building blocks Components Providers Connectors Crawler application Conclusions

  • Components

encapsulate field-specific functionality, produce, consume or transform data

  • Providers

give access to data sources

  • Connectors

provide asynchronous, message-based communication between components

slide-8
SLIDE 8

Crawler.NET

A component-based distributed framework for web traversal

Components

Introduction Component framework Building blocks Components Providers Connectors Crawler application Conclusions

  • abstract base class implements generic tasks
  • differentiated subclasses based on how they interact

with environment

slide-9
SLIDE 9

Crawler.NET

A component-based distributed framework for web traversal

Components

Introduction Component framework Building blocks Components Providers Connectors Crawler application Conclusions

GenericComponent Generic- Producer Generic- Consumer Simple- Filter Complex- Filter Asynchronous- ComplexFilter Synchronous- OutputFilter SemiSynchronous- ComplexFilter Synchronous- CompexFilter

slide-10
SLIDE 10

Crawler.NET

A component-based distributed framework for web traversal

Providers

Introduction Component framework Building blocks Components Providers Connectors Crawler application Conclusions

  • wrap external resources used by components
  • synchronized access to data sources
  • diverse functionality:
  • access databases
  • transparent cache mechanisms
  • network resources
slide-11
SLIDE 11

Crawler.NET

A component-based distributed framework for web traversal

Connectors

Introduction Component framework Building blocks Components Providers Connectors Crawler application Conclusions

  • abstractions of typed queues
  • represent a message queue
  • intra-process or inter-process
  • support one-to-many, many-to-many relationships,

identification by roles

slide-12
SLIDE 12

Crawler.NET

A component-based distributed framework for web traversal

Relization of connectors

Introduction Component framework Building blocks Components Providers Connectors Crawler application Conclusions

Method of message transfer transparent to components:

  • local connector

typed FIFO queue data is passed by reference

  • remote connector

corresponds to two local queues and associated network communication components in separate processes data is serialized (and transmitted over TCP)

slide-13
SLIDE 13

Crawler.NET

A component-based distributed framework for web traversal

Architecture

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

Client-server architecture:

  • clients retrieve documents with respect to the

appropriate traversal strategy

  • the server partitions the web and assigns partitions

to clients Implementation using component framework classes

slide-14
SLIDE 14

Crawler.NET

A component-based distributed framework for web traversal

Marshaler component

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

  • forwards incoming URLs to clients based on domain
  • r host name
  • caches recently forwarded URLs to decrease

network load

slide-15
SLIDE 15

Crawler.NET

A component-based distributed framework for web traversal

Marshaler component

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

Limited data exchange during web traversal:

  • locality principle: approx. 10% of hyperlinks are
  • utbound from host or domain
  • batch transmission
  • Zipfian distribution: discarding cached URLs leads to

sharply reduced load Load balancing between marshalers: URL distribution based on URL host name hash

slide-16
SLIDE 16

Crawler.NET

A component-based distributed framework for web traversal

Basic client components

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

Server Traversal component Downloader Parser URL distributor host, #new items document url, referrer url url, HTTP header, document content base url, links url, length, start/stop time, HTTP status code local url queue Client 1 external url internal url url belonging to Client 1 next url finished

slide-17
SLIDE 17

Crawler.NET

A component-based distributed framework for web traversal

Traversal component

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

Server Traversal component Downloader Parser URL distributor host, #new items document url, referrer url url, HTTP header, document content base url, links url, length, start/stop time, HTTP status code local url queue Client 1 external url internal url url belonging to Client 1 next url finished

slide-18
SLIDE 18

Crawler.NET

A component-based distributed framework for web traversal

Traversal component

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

  • fetches new URLs to download from persistent

storage

  • notification on arrival of new URLs from server or

availability of a host

  • selects next URL based on traversal strategy

(breadth-first, relevance-based, etc.)

Url distributor Traversal component Downloader host, #new items url, referrer url url queue finished

slide-19
SLIDE 19

Crawler.NET

A component-based distributed framework for web traversal

Load balancing component

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

  • prevents overloading hosts
  • cooperates with traversal components
  • configurable delay between requests
  • dynamic adaptation based on response times
slide-20
SLIDE 20

Crawler.NET

A component-based distributed framework for web traversal

Load balancing component

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

Url distributor Traversal component Downloader host, #new items url, referrer url url queue statistics available host Load balancer

slide-21
SLIDE 21

Crawler.NET

A component-based distributed framework for web traversal

Parser component

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

Server Traversal component Downloader Parser URL distributor host, #new items document url, referrer url url, HTTP header, document content base url, links url, length, start/stop time, HTTP status code local url queue Client 1 external url internal url url belonging to Client 1 next url finished

slide-22
SLIDE 22

Crawler.NET

A component-based distributed framework for web traversal

Parser component

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

Extracts hyperlinks from documents:

  • nonstandard HTML files
  • versatile hyperlink manifestations
  • performance issues (e.g. interpreting scripts)
  • document types and document encoding
slide-23
SLIDE 23

Crawler.NET

A component-based distributed framework for web traversal

URL distributor component

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

Server Traversal component Downloader Parser URL distributor host, #new items document url, referrer url url, HTTP header, document content base url, links url, length, start/stop time, HTTP status code local url queue Client 1 external url internal url url belonging to Client 1 next url finished

slide-24
SLIDE 24

Crawler.NET

A component-based distributed framework for web traversal

URL distributor component

Introduction Component framework Crawler application Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions

  • configurable URL-patters:

what to keep and what to omit

  • robot exclusion rules
  • crawler traps
slide-25
SLIDE 25

Crawler.NET

A component-based distributed framework for web traversal

Future work

Introduction Component framework Crawler application Conclusions Future work Summary

  • configuration and monitoring from graphical user

interface

  • dynamic, run-time re-configuration
  • dedicated data structures instead of databases for

large sets of data

  • crawl on simulated data set and actual crawl
slide-26
SLIDE 26

Crawler.NET

A component-based distributed framework for web traversal

Summary

Introduction Component framework Crawler application Conclusions Future work Summary

  • component framework for general tasks
  • loosely-coupled field-specific components
  • pen, extensible, scalable architecture
  • transparent caching mechanisms
  • declarative configurability

Available for download at SourceForge.net

http : / / s
  • u
r e f
  • r
g e . n e t / p r
  • j
e t s / w e b r a w l e r
slide-27
SLIDE 27

Crawler.NET

A component-based distributed framework for web traversal

Introduction Component framework Crawler application Conclusions Future work Summary

Thank you for your attention!