WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. - PowerPoint PPT Presentation

WEB ARCHIVING @ LIP6 S TÉPHANE G ANÇARSKI Z. P EHLIVAN , M. C ORD , M. B EN -S AAD , A. S ANOJA ,N. T HOME , M. L AW French ANR project Cartec (ended) European project Scape (ends 9/2014)

Web Archives  Web is ephemeral and constantly evolving  Need to preserve information WEB WEB ARCHIVES 50.000.000.000 165.000.000.000 (Google@2012) (Internet Archive@2012) non-cumulative Index Cumulative Index Freshness Coherence, Completeness, Preservation 2 / 54

Issues Avoid duplicates Temporal Queries, completeness, IR, indexing, Coherence navigation, coherence (Crawling) Spatial completeness discovery Emulation, migration , Cloud computing 3 / 54

1. Efficient crawling using segmentation and patterns  Efficient :  Maximize temporal completeness and coherence  Under limited resources (bandwith, politeness, storage…)  Temporal completeness :  how well the archive captures the history of a page.  Relevant for medium size archives (e.g. INA legal deposit)  For large size archives, spatial completeness  Temporal coherence  Capture versions of different pages that were present at the same time on the Web 4 / 54

Temporal completeness Importance of captured versions  temporal completeness = Importance of all the versions appeared on the Web  Importance of a page version ω(υ ω(P υ υ ) = ) * impCh ( , ) i j i i 1 Page importance Changes importance How to measure it ? 5 / 54

Measuring changes importance update insert What are the changes? Are they important ? Depends on where Version n- Version the change occurs 1 n Version n-1 Version n Change importance is somehow related with what users see on the page Render pages before analysis Users see blocks of information Use web page segmentation 6 18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation

Global overview Web www Crawler Archive Version V(n) Version V(n) Patterns Time series Importanc Change Pattern e Detection Discovery Version V(n-1) Estimation 7 18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation

Segmentation BOM BOM Page VIPS VIPS Extension B1 Page [Cai03] B1 B2 B3 ID ID ID B1 B2 B3 B2.1 B2.2 B2.3 B2.2 B2.1 B2.2 B2.3 B2.1 ID ID B2.3 ID B3 <xml> < Page url=‘ ‘ version=‘’ ….> < Block ref=‘B1’ pos=‘ ‘ <Links id=‘ ‘ > < link name= ‘ ‘ adr =‘ ’ /> < link name=‘ ‘ adr= ‘ ‘ /> <Links/> Images Texts <Images id=‘ ‘ > Links ID ID ID <img name =‘ ‘ src=‘ ‘ /> IDList IDList IDList IDList <Images/> Txt=‘Text1’ <Texts id=‘ ‘ text=‘ ‘ /> ID Link ID ID Img Link < Block /> Img ID < Block ref=‘B2’ id=‘ ‘ …> Name=‘Link1’ Name=‘Link2’ Name=‘img1’ Name=‘img2’ … Adr=‘ ‘ Adr=‘ ‘ Src=‘ ‘ Src=‘ ‘ < Block /> < Page /> <xml/> Vi-XML Document 8 18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation

Changes detection Page Page B2 B2 B1 ID ID ID B1 ID ID ID B3 B3 B2.1 B2.2 B2.1 B2.2 ID ID B2.3 ID ID B2.3 Vi-XML V(n) Vi-XML V(n-1) Version n-1 Version n Vi-DIFF • structural changes (O(n2)) • Content changes (O (n. log n)) 9 18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation

From changes to crawl scheduling  From successive delta files we can compute change patterns for pages  Blocks are weighted According to [Song&al@WWW04]  change importance between page versions = (normalized) weighted sum of the block changes  From page patterns and last crawl date, we can compute an urgency function that simulate the change importance accumulated on the page since the last crawl  We crawl the page that has maximum urgency 10

Experimentations Page versions are crawled from a « complete » archive, so that we can compute the completeness of each strategy 11

2. Accessing Web archives What if we click on this link ?

Coherent navigation P1[tq] Related Work  Recent: it returns the closest version before tq: P2[t1]  Nearest: it returns the nearest version by minimizing |tq – tx|:P2[t2] Our approach ? ?  Choose the version of P2 that has the maximum probability to be coherent with P1[tq] Ø according to P1 and P2s patterns P2[t1] [tq] P2[t2] Crawl time 13 / 54

Experiments  Count how many times coherent version chosen  Dataset  60 France TV channels  1000 hourly crawl  Simulation  Links between pages  Crawling  Results  15 % better than Nearest  40 % better than Recent 14 / 54

Access to Wacs Today  Wayback Machine, Internet Archive 15 / 54

Access to Wacs Today  Full text search 16 / 54

Why Query Language? Web Users ≠ WAC Users Historians Full-tex Journalists and Researcher Not navigation Web Philologist sufficient usually Web Archaeologists sufficient 17 / 54

Operators  Classic operators  Search only blocks : Page E Page D InBlock  Get version of page p at Page A t : Wayback  Incomplete: If p is not crawled at t : Nearest/Recent/Cohere Page B Page C nt  Navigational Operators : in, out, jump 18/54

Static Index Pruning  Index compressing: discard postings (term,doc) less likely to affect the retrieval performance  So that (part of) the index fits in main memory  Off-line • State of the art • Give a score to each posting • Based on a threshold filter out a part of postings • Obtain prune ratio ( % of removed postings) Random , TCP (Carmel et al. SIGIR '01) , IP-u (Chen et al. CIKM '12) , 2N2P (Thota et al. ECIR '11) , PRPP (Blanco et al. ACM '10) 19 / 54

Introducing diversification  Existing pruning techniques not designed with time in mind  Query « Iraq War »  Temporal dimension should be preserved while pruning ANOVA test shows that there is  a link between temporal coverage and retrieval performance We design 3 methods that takes  diversification into account for pruning Temporal aspects model :  windows of fixed size (simple and sliding) and dynamic (gaussian mixture model) 20 / 54

Experiments - Dataset  The English Wikipedia of July 7, 2009 with temporal queries and relevance judgements [Berberich et al. ECIR'10]  2,300,277 Wikipedia articles with temporal expressions  40 queries with temporal dimension and related relevance judgements  Our methods overpasses the existing ones, mostly when the prune ratio is high  Currently we try to experiment on larger archives (Portuguese Web Archive) 21 / 54

Results 22 / 54

3. Web page segmentation  Detect blocks of information in a page  Many application in Data Preservation and access  Crawl scheduling (cf. first part of the talk)  Emulation control: check if an archived page can be properly rendered with a new browser (if not, keep the old browser)  Migration control : archived pages must be migrated (e.g. change archive file format), control if same rendering after/before migration  Mobile devices (small screen): display blocks, not whole page  HTML4 to HTML5 migration (map blocks to HTML5 tags, current work)  Etc… 23 / 54

Block-o-Matic: web page segmentation Content Categories : Root, tabular, forms,links, ... Labels : Header, navigation, article, content,… reconstruction analysis understanding rendering DOM Content Logic Flow W W 24 W'

# of MoB tool Evaluation Method elements in common Ground Truth Segmentation F. Shafait, D. Keysers, and T. Breuel. Performance evaluation and benchmarking of six-page segmentation algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(6):941 – 954, 2008. 25

Experiments 26

Future/current work  Block based information extraction…  Detecting blocks is an extraction task  Information extraction = add semantics to blocks  Automatic classification : specific rules based on geometry, text proportion…  Extract information contained by blocks  Leverage the segmentation to optimize the extraction process (ex: objet extracted from the same block, adjacent blocks,…)  Related with linked object  ML/image processing techniques (cf. N. Thome presentation)  Enhance the comparison algorithm (classifier: similar/different)  Learning blocks weight (go beyond Song’s approach) 27

Obrigado. 28

WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. - PowerPoint PPT Presentation

WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. B EN -S AAD , A. S ANOJA ,N. T HOME , M. L AW French ANR project Cartec (ended) European project Scape (ends 9/2014) Web Archives Web is ephemeral and constantly

Web Archiving Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, May 27, 2010 Databases and

Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving

Introduction to Web Archiving Marc Spaniol Marc Spaniol Saarbrcken, May 28, 2009 Databases

ScoutFS: POSIX Archiving at Extreme Scale Zach Brown, Versity MSST 2019 POSIX Archiving with

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

The End of Term Archive: Archiving the U.S. Government Web MLTW | Dec. 5, 2017 Abigail Grotke,

Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013 July 25-26, 2013

Introduction to Coccinelle Julia Lawall (Inria/LIP6) http://coccinelle.lip6.fr March 27, 2014

Archiving the Websites of Contemporary Composers Bess Pittman, Project Web and Processing

Public Librarians to Create Community History Web Archives Hoan-Vu Do Public Library Why is web

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

TM OPARG Optical Archive Group Alliance to promote Optical Disc Archiving - For

Digital archiving and on-line publishing of old relief models Mtys Gede Jnos Mszros

Data Archiving in iRODS Data Management Platform User Interfaces ? CLI CLI Plugins APIs

Archiving Quantitative Child Maltreatment Data National Data Archive on Child Abuse and Neglect

MCDA-GIS integration: an application in GRASS GIS 6.4 Massei G., Rocchi L., Paolotti L.*,

Similarity-based Learning Methods for the Semantic Web Claudia dAmato Dipartimento di

Towards a cloud-based computing and analysis framework to process environmental science big data

QLectives: evolving software to support quality Nigel Gilbert and the QLectives team This work

Be certain of how-to before mining uncertain data F. Gullo G. Ponti A. Tagarelli

Probase Haixun Wang Microsoft Research Asia Short Text Document Title Search Caption

SMART MANUFACTURING With Apache Spark Streaming and Deep Leaning Prajod Vettiyattil, Wipro

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. - PowerPoint PPT Presentation

WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. B EN -S AAD , A. S ANOJA ,N. T HOME , M. L AW French ANR project Cartec (ended) European project Scape (ends 9/2014) Web Archives Web is ephemeral and constantly

Web Archiving Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, May 27, 2010 Databases and

Selective W eb Archiving at the Germ an National Library 1 | 8 | Selective Web Archiving

Introduction to Web Archiving Marc Spaniol Marc Spaniol Saarbrcken, May 28, 2009 Databases

ScoutFS: POSIX Archiving at Extreme Scale Zach Brown, Versity MSST 2019 POSIX Archiving with

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

The End of Term Archive: Archiving the U.S. Government Web MLTW | Dec. 5, 2017 Abigail Grotke,

Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013 July 25-26, 2013

Introduction to Coccinelle Julia Lawall (Inria/LIP6) http://coccinelle.lip6.fr March 27, 2014

Archiving the Websites of Contemporary Composers Bess Pittman, Project Web and Processing

Public Librarians to Create Community History Web Archives Hoan-Vu Do Public Library Why is web

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

TM OPARG Optical Archive Group Alliance to promote Optical Disc Archiving - For

Digital archiving and on-line publishing of old relief models Mtys Gede Jnos Mszros

Data Archiving in iRODS Data Management Platform User Interfaces ? CLI CLI Plugins APIs

Archiving Quantitative Child Maltreatment Data National Data Archive on Child Abuse and Neglect

MCDA-GIS integration: an application in GRASS GIS 6.4 Massei G.*, Rocchi L.*, Paolotti L.*,

Similarity-based Learning Methods for the Semantic Web Claudia dAmato Dipartimento di

Towards a cloud-based computing and analysis framework to process environmental science big data

QLectives: evolving software to support quality Nigel Gilbert and the QLectives team This work

Be certain of how-to before mining uncertain data F. Gullo G. Ponti A. Tagarelli

Probase Haixun Wang Microsoft Research Asia Short Text Document Title Search Caption

SMART MANUFACTURING With Apache Spark Streaming and Deep Leaning Prajod Vettiyattil, Wipro

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

MCDA-GIS integration: an application in GRASS GIS 6.4 Massei G., Rocchi L., Paolotti L.*,