Introducing Web Fragments An exploration of web archives beyond the - PowerPoint PPT Presentation

Introducing Web Fragments An exploration of web archives beyond the webpages Quentin Lobbé (LTCI, Télécom ParisTech & Inria Paris) DBWeb seminar – May 31, 2017

The e-diasporas Atlas > A collection of online migrant collectives | m a r o c a i n s d u m o n d e . g o v . m a | l a r b i . o r g A migrant web site is a website created or managed by migrants and/or that deals with them | y a b i l a d i . c o m An e-Diaspora is a directed network of migrant websites linked by url | m o r o c c o b o a r d . c o m 10.000 migrant websites crawled, categorized and organized among 30 e-diasporas

The e-diasporas Atlas > A tool for sociological analysis y a b i l a d i . c o m > moroccan e-diasporas

The e-diasporas Atlas > A tool for sociological analysis Associations y a b i l a d i . c o m Blogs Institutions > moroccan e-diasporas

Facing the evolutions of e-Diasporas ... > new website > alternative spaces of expression y a b i l a d i . c o m > death of blogs > new link > moroccan e-diasporas

… we build a corpus of web archives > To keep a trace of the evolutions of websites page 1 page 2 page 1 page 2 page 3 > time 1 > time 2 record 1 record 2 > Our corpus is a 70 To web archive, categorized by e-diasporas corpus, crawled weekly or Monthly, between 2010 and 2015 hosted at the INA

Our original research questions > Considering the e-Diasporas archived corpus Can the structure and content of the archived e-Diasporas be permeable to the efgects of shocks and external events such as political and social mobilizations? > Considering any archived corpus How can we follow traces through web archives in order to deal with a given event and its genesis by restoring it in the dual temporality of the web and the real?

The naive approach > focusing on the particular case of yabiladi.com a hub at the center of the network an ancient and hybrid website forum videos y a b i l a d i . c o m since 2002 news dating > 2.8 Millions of archived pages

The naive approach > considering all the archived pages as traces of activities on the website Number of new archived pages by day > Are those peaks and valleys relevant ?

The naive approach > considering all the archived pages as traces of activities on the website

Web archives are not direct traces of the web > web archives should be considered as direct traces of the crawler Continuous Web Discrete Archives download download download date 1 date 2 date 3 > We saw what we call a crawl legacy efgect

To avoid the crawl legacy efgect We propose to conduct an exploratory analysis of web archives which would go beyond the level of the webpages

The original scale of web archives is the webpage > what can we learn from the structure of web archives fjles? .WARC .DAFF t1 t2 t1 t2 meta meta meta meta crawler date crawler date crawler date crawler date download date download date download date download date data data data html content html content html content > by defjnition, web archives are built on top of webpages

Archiving is all about selecting and destroying > as webpages change over time > structural changes move, copy, delete, inserte, update … > attribute changes css, font … > type changes <div> to <p> > semantic changes > "Boulevard du Temple", Louis Daguerre, 1838

Archiving on top of webpages goes with many challenges > Crawler blindness and archive quality edition dates crawler dates download dates archived periods > Web archiving goes with construction locks

Archiving on top of webpages goes with many challenges > Archive consistency across pages p1 changes p2 changes p1 & p2 archives href ? p1 p2 > Web archiving goes with navigation locks

Archiving on top of webpages goes with many challenges > Pages with archive-like content p1 changes p1 archives > Archiving goes with discrete and continuous interpretation locks

To face or reduce these challenges We propose to build a new entity from based on web archives called web fragments meta data ? web page

The web fragment > A structured part of a webpage with high informationing contents > New structure for web archives an article meta crawler date data page download date page content a news item data frag data frag edition date edition date a comment author author frag content frag content

Finding web fragments > We must see a webpage as a front & back end object back end front end screen <div id="com-" class="news news_plus" style="padding-top:0px;"> <div class="efget_special"> <div class="com-header"> <div class="comment-subject"> <div class="icone-comment iconuser_m" style=""> Kim </div> </div> <div class="com-info"> <a style="font-weight:400;" href="">Kim</a> 08 mai 2017 à 12h28 <br> <span class="com-auteur">08 mai 2017 à 12h28</span> </div> </div> Blabla <span id="nombre-" style="display:none;"></span> <div class="buzz"> </div> </div> <div class="com-content" id="content-comment8537568"> Blabla </div> <div style="foat:left;width:100%;"> </div> </div> - or an ordered tree - or an unordered tree - a fat-fjle Related works :

Finding web fragments > A webpage is a 2D hierarchical list of HTML nodes depth <div id="com-8537568" class="new comment"> <div class="efget_special"> <div class="com-header"> <div class="com-info"> sequence <a class="com-author" href="/profjl/24368/kim.html">Kim</a> <br> <span class="com-date">le 08 mai 2017 à 12h28</span> </div> </div> <span id="nombre-" style="display:none;"></span> <div class="buzz"> </div> </div> <div class="com-content" id="content-comment8537568"> blabla </div> <div style="foat:left;width:100%;"> </div> </div> > Nodes are categorized among : title, author, date and text

Finding web fragments > Nodes are selected based on markup & class & id using regex <h1 id = ''title'' class = ''title_comment''> Hello archives </h1> > Nodes are incrementally grouped into web fragments using ad-hoc rules [ U text ] or [ text U _text ] or [ title U text ] or [ date U _text ] or [ author U date ] ... > Algorithm 1, Select nodes in DOM 2, Group in fragments 3, Group by list of fragments text text [ text text ] title [ text text ] author [ title author text ] text [ title author text ] author [ author date text ] date [ author date text ], [ author date text ] text [ author date text ] author date text

Rethinking archive challenges using web fragments > Crawler blindness can be reduced and archive quality increased download date edition date 2 edition date 1 Yabiladi's older fragments go back to 2003 > We introduce a more permissive archive consistency based on fragments and user requests href stable fragment stable fragment page 1 new fragment page 2

Rethinking archive challenges using web fragments > Pages with archive-like content is no more a problem with web fragments as a search unit base Sharing the same id (sha256) > Web fragments help us expanding web archives beyond web pages Now let's see how we can concretely conduct an exploratory archive analysis ...

Exploratory analysis of Web archives > Following John Wilder Tukey's work An iterative process that is deliberately part of a logic of observation, discovery and astonishment Acquire Parse Filter Mine Represent Refjne Interprete

Archives extraction engine Acquire Parse Filter Mine Represent Refjne Interprete > The Web Archives Explorer (part 1) meta Fragments crawler date Extractor y a b i l a d i . c o m data page Crawler ArchiveMiner download date page content .DAFF External Resources data frag edition date author frag content

Archives exploration engine Acquire Parse Filter Mine Represent Refjne Interprete > The Web Archives Explorer (part 2) meta crawler date Index of Events Index of Pages & Fragments Full text data page Facet ArchiveSearch ArchiveViz download date page content Ngrams data frag edition date author frag content

The validation of web fragments Acquire Parse Filter Mine Represent Refjne Interprete > Using an event detection system 2. identifjcation with titles of news articles 3. fjelds and experts interpretations 1. threshold-based detection > Let's see the Web Archives Explorer in action video presentation for CIKM2017

Introducing Web Fragments An exploration of web archives beyond the - PowerPoint PPT Presentation

Introducing Web Fragments An exploration of web archives beyond the webpages Quentin Lobb (LTCI, Tlcom ParisTech & Inria Paris) DBWeb seminar May 31, 2017 The e-diasporas Atlas > A collection of online migrant collectives | m

Chemspace Modifiable Fragments Acid fragments and Amine fragments Description Presence of

CS 4518 Mobile and Ubiquitous Computing Lecture 7: Fragments, Camera Emmanuel Agu Fragments

Presenting Fragments as Quotations or Quotations as Fragments A Digital Edition of the Fragments

CS 403X Mobile and Ubiquitous Computing Lecture 8: Fragments Camera Emmanuel Agu Fragments

Lab 8 Fragments KUAN-TING LAI 2020/10/8 Fragments: Make It Modular Fragments:

CS 528 Mobile and Ubiquitous Computing Lecture 4a: Fragments, Database and Firebase Cloud API

Admissible Rules of (Fragments of) R-Mingle Admissible Rules of (Fragments of) R-Mingle Laura

Introducing more people Introducing more people Introducing more people Introducing more people

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

From Small Carbon Fragments to Self- From Small Carbon Fragments to Self- Assembled Fullerenes in

CS378 - Mobile Computing What's Next? Fragments Added in Android 3.0, a release aimed at

Finite Model Reasoning in Expressive Fragments of First-Order Logic Lidia Tendera Institute of

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

PROFESSIONAL LEARNING WITH SOCIAL MEDIA The power lies in the hearts and minds of the people

Objects Lecture 5 IML 499 A rhizome has no beginning or end; it is always in the middle,

Rhizoma: A Runtime for Self-deploying, Self-managing Overlays Qin Yin * , Adrian Schpbach * ,

Invest estig igation ions on on Decom ompos position ition Cha haracterist acteristics

Real-Time Luke Christison luke.christison@plymouth.ac.uk Immersive Vision Theatre Office hours

1936 The Work of Art in the Age of Mechanical Reproduction Walter Benjamin He describes a

Decomposing the deviance in GLMMs, with applications in marine ecology Mariangela SCIANDRA,

Formal Concept Analysis III Knowledge Discovery Robert J aschke Asmelash Teka Hadgu FG

Sambuz

Useful Links

Newsletter

Mail Us