crawler net a component based distributed framework for
play

Crawler.NET: A component-based distributed framework for web - PowerPoint PPT Presentation

Crawler.NET: A component-based distributed framework for web traversal Levente Hunyadi (BME AAIT) March 23, 2007 Crawler.NET A component-based distributed framework for web traversal Motivation Introduction The Web: Motivation a source


  1. Crawler.NET: A component-based distributed framework for web traversal Levente Hunyadi (BME AAIT) March 23, 2007 Crawler.NET A component-based distributed framework for web traversal

  2. Motivation Introduction The Web: Motivation a source of distributed information Objectives � Architecture a giant set of semi-structured data � Component framework ⇒ search engines are invaluable to locate information Crawler application Conclusions Crawler.NET A component-based distributed framework for web traversal

  3. Motivation Introduction up-to-date index database � Motivation ⇓ Objectives efficient traversal � Architecture ⇓ Component framework parallelization � Crawler ⇓ application distributed architecture � Conclusions ⇓ increased complexity � Crawler.NET A component-based distributed framework for web traversal

  4. Objectives Introduction scalability � Motivation easy configuration and management � Objectives Architecture support for extension � Component framework robustness, resilience to failures � Crawler application Conclusions Crawler.NET A component-based distributed framework for web traversal

  5. Architectural overview Introduction Two separate layers: Motivation Component framework Crawling application Objectives Architecture General tasks Field-specific issues Component framework component interaction downloading � � Crawler documents application lifecycle management � extracting hyperlinks Conclusions � transparent � interprocess administering page � communication references scheduling requests � Crawler.NET A component-based distributed framework for web traversal

  6. Design Introduction the component framework exposes general � Motivation component skeletons that realize common behavior Objectives new, field-specific components are created by Architecture � Component means of inheritance framework the framework provides loose coupling between Crawler � application components Conclusions Advantages: + simpler and faster development + openness for extension Crawler.NET A component-based distributed framework for web traversal

  7. Building blocks of the architecture Components Introduction � Component encapsulate field-specific functionality, produce, framework consume or transform data Building blocks Components Providers � Providers give access to data sources Connectors Connectors Crawler � application provide asynchronous, message-based Conclusions communication between components Crawler.NET A component-based distributed framework for web traversal

  8. Components Introduction abstract base class implements generic tasks � Component differentiated subclasses based on how they interact framework � Building blocks with environment Components Providers Connectors Crawler application Conclusions Crawler.NET A component-based distributed framework for web traversal

  9. Components Introduction GenericComponent Component framework Building blocks Components Generic- Generic- Simple- Complex- Providers Producer Consumer Filter Filter Connectors Crawler application Synchronous- Asynchronous- Conclusions OutputFilter ComplexFilter Synchronous- SemiSynchronous- CompexFilter ComplexFilter Crawler.NET A component-based distributed framework for web traversal

  10. Providers Introduction wrap external resources used by components � Component synchronized access to data sources framework � Building blocks diverse functionality: � Components Providers access databases � Connectors transparent cache mechanisms � Crawler application network resources � Conclusions Crawler.NET A component-based distributed framework for web traversal

  11. Connectors Introduction abstractions of typed queues � Component represent a message queue framework � Building blocks intra-process or inter-process � Components Providers support one-to-many, many-to-many relationships, � Connectors identification by roles Crawler application Conclusions Crawler.NET A component-based distributed framework for web traversal

  12. Relization of connectors Introduction Method of message transfer transparent to components: Component local connector framework � Building blocks typed FIFO queue Components data is passed by reference Providers Connectors remote connector � Crawler corresponds to two local queues and associated application network communication components in separate Conclusions processes data is serialized (and transmitted over TCP) Crawler.NET A component-based distributed framework for web traversal

  13. Architecture Introduction Client-server architecture: Component clients retrieve documents with respect to the framework � Crawler appropriate traversal strategy application the server partitions the web and assigns partitions Architecture � Server components to clients Client components Implementation using component framework classes Traversal Load balancing Parsing URL distributor component Conclusions Crawler.NET A component-based distributed framework for web traversal

  14. Marshaler component Introduction forwards incoming URLs to clients based on domain � Component or host name framework Crawler caches recently forwarded URLs to decrease � application network load Architecture Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions Crawler.NET A component-based distributed framework for web traversal

  15. Marshaler component Introduction Limited data exchange during web traversal: Component framework locality principle : approx. 10% of hyperlinks are � Crawler outbound from host or domain application Architecture batch transmission � Server components Zipfian distribution : discarding cached URLs leads to � Client components sharply reduced load Traversal Load balancing Load balancing between marshalers: URL distribution Parsing based on URL host name hash URL distributor component Conclusions Crawler.NET A component-based distributed framework for web traversal

  16. Basic client components Introduction Server Component framework url belonging to Client 1 Crawler local url queue Client 1 application external Architecture internal url next url url Server components host, #new items URL distributor Traversal component Client components url, length, start/stop time, document url, Traversal base url, HTTP status code referrer url links Load balancing Parsing url, HTTP header, document content URL distributor Parser Downloader finished component Conclusions Crawler.NET A component-based distributed framework for web traversal

  17. Traversal component Introduction Server Component framework url belonging to Client 1 Crawler local url queue Client 1 application external Architecture internal url next url url Server components host, #new items URL distributor Traversal component Client components url, length, start/stop time, document url, Traversal base url, HTTP status code referrer url links Load balancing Parsing url, HTTP header, document content URL distributor Parser Downloader finished component Conclusions Crawler.NET A component-based distributed framework for web traversal

  18. Traversal component Introduction fetches new URLs to download from persistent � Component storage framework Crawler notification on arrival of new URLs from server or � application availability of a host Architecture Server selects next URL based on traversal strategy components � Client (breadth-first, relevance-based, etc.) components Traversal Load balancing host, #new items Url distributor Traversal component Parsing URL distributor component url, referrer url Conclusions url queue Downloader finished Crawler.NET A component-based distributed framework for web traversal

  19. Load balancing component Introduction prevents overloading hosts � Component cooperates with traversal components framework � Crawler configurable delay between requests � application Architecture dynamic adaptation based on response times � Server components Client components Traversal Load balancing Parsing URL distributor component Conclusions Crawler.NET A component-based distributed framework for web traversal

  20. Load balancing component Introduction Load balancer Component framework available host Crawler host, #new items Url distributor Traversal component application Architecture url, referrer url Server components url queue Client Downloader statistics components Traversal Load balancing Parsing URL distributor component Conclusions Crawler.NET A component-based distributed framework for web traversal

  21. Parser component Introduction Server Component framework url belonging to Client 1 Crawler local url queue Client 1 application external Architecture internal url next url url Server components host, #new items URL distributor Traversal component Client components url, length, start/stop time, document url, Traversal base url, HTTP status code referrer url links Load balancing Parsing url, HTTP header, document content URL distributor Downloader Parser finished component Conclusions Crawler.NET A component-based distributed framework for web traversal

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend