where the dead blogs are
play

Where the dead blogs are A Disaggregated Exploration of Web archives - PowerPoint PPT Presentation

Where the dead blogs are A Disaggregated Exploration of Web archives to Reveal Extinct Online Collectives Quentin Lobb (LTCI, Tlcom ParisTech, Universit Paris Saclay & Inria) 34me Confrence sur la Gestion de Donnes (BDA2018)


  1. Where the dead blogs are A Disaggregated Exploration of Web archives to Reveal Extinct Online Collectives Quentin Lobbé (LTCI, Télécom ParisTech, Université Paris Saclay & Inria) 34ème Conférence sur la Gestion de Données (BDA2018) The 20th International Conference on Asia-Pacifjc Digital Libraries (ICADL2018)

  2. The online representations of diasporas > Migrants are the actors of a culture of bonds > mondeberbere.com, Morocco, 2002 > bok.net/pajol, France, 1996 > Personal laptop of a couple of Philippines workers in Paris, Diminescu, D. (2005) > By the mid 2000's, sociologists started to study the many digital traces left by diasporas Diminescu, D. (2008), The connected migrant: an epistemological manifesto , Social Science Information, 47 Lafmaquière, J. et al (2005), Archiver le Web sur les migrations : quelles approches techniques et scientifjques ? , Migrance, 23

  3. The e-Diasporas Atlas (1/2) > A multidisciplinary efgort to discover and study online migrant collectives A migrant web site is a Web site created or managed by migrants and/or that deals with them s i t e 1 An e-Diaspora is a directed network of migrant Web sites linked by url (hypertext links) s i t e l i n k l i n k 3 1 2 2 1 An e-Diaspora is both online and offmine l i n k 2 3 s i t e 10.000 migrant Web sites crawled, 2 categorized and organized among 30 e-diasporas Diminescu, D, (2012), E-Diasporas Atlas: Exploration and Cartography of Diasporas on Digital Networks , Ed, de la Maison des sciences de l'homme, 2012 http:/ /www.e-diasporas.fr/

  4. The e-Diasporas Atlas (2/2) > How to read and use the map? (a) associations and ONG yabiladi.com (c) blogs (b) institutional sites bladi.net > The Moroccan e-Diaspora, by Dana Diminescu & Matthieu Renault

  5. The question of extinct online collectives > A community for which too few or incomplete traces remain on the living Web 2008 2018 degree degree alive alive deserted mlouizi.unblog.fr larbi.org lailalalami.com lailalalami.com 7didane.org > The Moroccan blogosphere (close up and evolution)

  6. > What happened to the dead Moroccan blogs? We hypothesize that the structure of the blogosphere is permeable to the impact of exogenous events or shocks such as political or social mobilisations. We will conduct an exploration of the e-Disaporas corpus of Web archives to fjnd their remaining archived traces. The e-Diaspora Atlas is also a corpus of Web archives 1030 M of Web pages 70 TB Crawled weekly or monthly (2010-2014) Hosted and performed by the INA

  7. Archiving the Web? (1/2) > The preservation of our digital heritage From the continuous Web p 3 p p p p 1 1 2 2 p 2 p 4 ( p ) t 1 ( p ) t 2 ( p ) t 3 ( p ) t 4 c r a w l c c r a w l c c r a w l c 1 2 3 > Web archives fjle formats (see WARC) .DAFF To a discrete corpus of Web archives

  8. Archiving the Web? (2/2) > Exploration tools are designed for manual and focused analysis early 90's invention of the Web 2003 Unesco & Digital Heritage 1996 Archive.org 2011 french “dépôt légal du web” > search by URL > aggregators > local access > full text WEB.TODAY > Why is it so hard to conduct an exploration of Web archives at scale ?

  9. Web archives are not direct traces of the Web (1/2) > Web archives are direct traces of the crawler > "Boulevard du Temple", Louis Daguerre, 1838 > Web archives are built on top of Web pages and induce crawl legacy efgects

  10. Web archives are not direct traces of the Web (2/2) > Going under the level of a Web page fjlter site get forum get posts .DAFF yabiladi.com 156 Moroccan 2.683.928 archives migrant Web sites 109,534 threads 422.906 posts download date edition date 30000 - number of archived pages 20000 - 10000 - 2004 2006 2010 2012 2014 2008

  11. In order to conduct a large scale exploration of the Web that was: > We propose to introduce a new unit of exploration of Web archives corpora to avoid all king of crawl legacy efgects and maximise the historical accuracy of our forthcoming exploration.

  12. The Web fragment (1/3) > Defjnition Considering the Web page as the unit of access and consultation to the Web, built using it's own writing modalities and noticing that from the point of view of human perception, a Web page is the result of a logical arrangement of distinct semantic components. We defjne the Web fragment as a semantic and syntactic subset of a given Web page. f f 1 1 1 2 p 1 f 1 3 Bernard, M. 2003, Criteria for optimal web design (designing for usability), 2003 Michailidou, E. et al. 2008, Visual Complexity and Aesthetic Perception of Web Pages, (SIGDOC 08)

  13. The Web fragment (2/3) > Defjnition It's a coherent and self suffjcient set of textual, visual or audio content f j k There is a scale relationship between a Web page and its fragments ? pure meta data full Web page Within the same Web page, two Web fragments cannot overlap ∩ f =∅ f 1 1 1 2

  14. The Web fragment (3/3) > Defjnition It goes with an associated set of categorised informations Is there any title ? author name ? Or any edition date ? f φ( f ) j k j k It encompass the writing and sharing elements used for publishing and sharing its content Is there any CMS widgets ? href links ? Or any rss feed ? f j k

  15. Upscalling the exploration (1/3) > Crawl blindness ∀ p ∃φ( f ) : φ( f )≤ t ( p ) , f j j k j k j k i j ( p ) t download date i j φ( f ) edition date 2 j 2 φ( f ) edition date 1 j 1 p a g e p j ( p )−φ( f ) For yabiladi.com quartiles of in days are : (Q1) 256, (Q2) 777, (Q3) 1340 t i j j k

  16. Upscalling the exploration (2/3) > Disaggregated observable coherence We defjne a discrete subset of fragments of interest ∩ ∀ p ∀ f ∈{ f m } , ∃ t ∈ [φ( f ) , ( p )]≠∅ * * , 1 , . . . , f : * * t t j j k j j c o h e r e n c e c o h e r e n c e j k i j j ( p ) t 1 1 ( p ) t 2 1 φ( f ) 1 1 c o h e r e n c e i n t e r v a l t b e t w e e n p p c o h e r e n c e 1 , 2 * c o h e r e n c e i n t e r v a l t u s i n g f 1 , f c o h e r e n c e 1 2 1 ( p ) t 1 2 ( p ) t 2 2 φ( f ) 2 1 And introduce a more permissive coherence model based on a specifjc research question Spaniol, M. et al (2009), Data quality in Web archiving , (WICOW'09)

  17. Upscalling the exploration (3/3) > Duplicated archived contents ( p ) ( p ) t t 1 1 2 1 ( p ) t i 1 f r a g m e n t f 1 1 p a g e p 1 f r a g m e n t f 1 1 ( c ( f ))= c ( f ) i d 1 1 1 2 1 1 p a g e p 1 In practice, we deduplicate with a id(sha256) on each Web fragment For yabiladi.com quartiles of duplicated fragments : (Q1) 1, (Q2) 1, (Q3) 2, (Max) 44

  18. Finding Web fragments > Technical fragmentation and information extraction D O M t r e e t (1) <node 1\> > Clustering closest HTML nodes using Readability and Fathom <node 2\> <node 3\> ={ n } p 1 , , n . . . j 4 <node 4\> p a g e p j (2) title? > Distance function relies on yes no <node 2\> <node 1\> vision / tag based penalties (3) author? and ad-hoc rules. It can be set up by the researcher yes no <node 4\> <node 3\> date? = n ∪ n = n ∪ n f f j 1 2 4 j 2 1 3 yes no D. Cai et al, 2003. Vips: a vision-based page segmentation algorithm. (2003) A. Jatowt et al, 2007. Detecting age of page Content. (2007) C. Kohlschütter et al, 2010. Boilerplate detection Using Shallow Text Features. (WSDM ’10)

  19. Building an exploration engine > From archive fjles to search and visualisation facilities Confjgurations & external data HDFS (a) handler Node.js Spark Solr visualisation .DAFF schema user index (b) fjlter by site group by id's .DAFF fjlter by date join by id's fragmentation indexation meta .DAFF data Lobbé, Q. 2018, Revealing historical events out of Web archives , TPDL 2018

  20. The archived traces of digital mutation (1/3) > Finding fragments mentioning social networks <span class="Twitter"></span>, Facebook 2008 2018 followers degree degree alive alive deserted social networks larbi.org larbi.org lailalalami.com lailalalami.com 7didane.org 7didane.org Authors kept their pseudonyms (or a close variation) from blogs to social platforms

  21. The archived traces of digital mutation (2/3) > Moving into new Web territories labelash.blogspot.com eatbees.com/blog anasalaoui.com Youtube sonofwords.blogspot.com Medium sahara-libre.blogspot.com Pinterest saad.amrani.free.fr/blog kingstoune.com cabalamuse.wordpress.com blogreda.blogspot.com 9afia.blogspot.com Twitter myrtus.typepad.com larbi.org sebti.fr Flicker magiaenmarruecos.blogspot.com lailalalami.com Facebook 7didane.org oef75.blogspot.com Mediapart mlouizi.unblog.fr lesamismarocains.blogspot.com lallamenana.free.fr The expression is fragmented and Graph density went from 0,16 in 2008 specialized by type of medium to 0,24 in 2018 (blogs vs twitter)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend