Michalis Vazirgiannis
Ar Archiving g th the e Web eb si site tes of
- f Ath
Athen ens Univer ersity of
- f Econ
- nom
- mics and Busi
Archiving Ar g th the e Web eb si site tes of of Ath Athen - - PowerPoint PPT Presentation
Archiving Ar g th the e Web eb si site tes of of Ath Athen ens Univer ersity of of Econ onom omics and Busi sines ess Michalis Vazirgiannis We Web A Arch rchiv ivin ing Introduction Context Motivation
Archive’s mission: Chronicle the history of the Internet Data is in danger of disappearing Rapid evolution of the web and changes in the web content Hardware systems do not last forever Malicious Attacks The rise of application-based market is a potential “web
Critical to capture the web content while it still exists in such
Average Hosting Provider Switching per month: 8.18% Web Sites Hacked per Day: 30,000 34% of companies fail to test their tape backups, and of
Every week 140,000 hard drives
Loss of valuable Information from websites
Absence of major web-archiving activities in within
Archiving the Web sites of Athens University of
The Wayback Machine
Archive-it
International Internet Preservation Consortium
Open Library
Many more individual archiving
Stor
Compressed set of uncorrelated web pages
Impo
Index
and Retrieval Algorithms
Ac
functionality and presents it to the end-user Web Page Web Objects:
Text, links multimedia files
Web Crawl
Selection Policy
Re-visit Policy
Politeness Policy
Parallelization policy
Crawler frontier
Page downloader Web repository
Breadth First Search Algorithm
queue
page and add to the queue
Depth First Search Algorithm
repeat steps
Page Rank Algorithm
Genetic Algorithm
Naïve Bayes classification Algorithm
Text Tokenization
Language-specific stemming Definition of Stop words Distributed Index
Boolean model (BM)
Vector Space Model (VSM)
Heritrix
content in the System
Wayback Machine
Apache Solr
Open Source Software
Heritrix Crawler
Data storage in the WARC format (ISO 88500 2009)
Collects data based on 75 seed URL’s Re-visiting Policy: Checks for updates once per Month Politeness Policy: Collects data with respect to the Web Server.
Every 10 seconds from the same Web Server With a time delay of ten times the duration of the last crawl
Part of a WARC file that crawls through the aueb.gr domain Captures all Html text and elements Content under aueb.gr can be fully reconstructed
Creates the index based only on the URL and the day that the
Queries must have a time frame parameter
Full-text search of the archived documents based on the
Uses a combinations of the Boolean model and space vector
Documents "approved" by BM are scored by VSM
~500.000 URL’s visited
~500 hosts visited The steady numbers
Archive of ~500.000 URIs with monthly frequency Data from the Network:
Storage of the novel Urls only:
Storage in compressed format:
Unify all functionality (Wayback and Solr )
Experiment with different metrics and
Collect data through web forms (Deep Web)
Pavalam, Raja, Akorli and Jawahar. A Survey of Web Crawler Algorithms.
Jaffe, Elliot and Kirkpatrick, Scott. Architecture of the Internet Archive. ACM.
Vassilis Plachouras, Chrysostomos Kapetis, Michalis Vazirgiannis, "Archiving
Gomes, Miranda, Costa. A survey on web archiving initiatives. Foundation
Udapure, Kale, Dharmik. Study of Web Crawler and its Different Types. IOSR