Accessin ing Web Archives from Diffferent Perspectives wit ith Potential Synergie ies
RESAW/IIPC, , Lo London 2017
Helge Holzmann, Thomas Risse 06/15/2017
Helge Holzmann (holzmann@L3S.de)
from Diffferent Perspectives wit ith Potential Synergie ies - - PowerPoint PPT Presentation
Accessin ing Web Archives from Diffferent Perspectives wit ith Potential Synergie ies RESAW/IIPC, , Lo London 2017 Helge Holzmann, Thomas Risse 06/15/2017 Helge Holzmann (holzmann@L3S.de) 06/15/2017 Helge Holzmann (holzmann@L3S.de)
RESAW/IIPC, , Lo London 2017
Helge Holzmann, Thomas Risse 06/15/2017
Helge Holzmann (holzmann@L3S.de)
06/15/2017 Helge Holzmann (holzmann@L3S.de)
Web Web Web Web Web Social Networks & Streams Linked Open Data Cloud Entity Resolution & Evolution Web Archive & Index
t4 t3 t2 t1 tnow
Time-Aware Entity Graph
t4 t3 t2 t1 tnow t2 t3 t4 tnow t1
Time- and Entity- Based Retrieval 1 2 3 4 6 7 Aggregation & Time-Aware Indexing Entity Linking 5 Improvement Enrichment
complex queryCollaborative Exploration & Analytics
2
06/15/2017 Helge Holzmann (holzmann@L3S.de)
http://blog.archive.org/2016/09/19/the-internet-archive-turns-20/3
06/15/2017 Helge Holzmann (holzmann@L3S.de)
4
06/15/2017 Helge Holzmann (holzmann@L3S.de)
5
06/15/2017 Helge Holzmann (holzmann@L3S.de)
6
06/15/2017 Helge Holzmann (holzmann@L3S.de) 7
06/15/2017 Helge Holzmann (holzmann@L3S.de) 8
Helge Holzmann (holzmann@L3S.de) 06/15/2017
http://web.archive.org/web/20121107020708/http://www.nytimes.com/ http://web.archive.org/web/TIMESTAMP/URL
9
06/15/2017 Helge Holzmann (holzmann@L3S.de)
10
Helge Holzmann (holzmann@L3S.de)
Artifacts provided for highly referenced articles ~60% link to some sort of documentation ~30% provide source code
06/15/2017
11
Helge Holzmann (holzmann@L3S.de) 06/15/2017
12
Helge Holzmann (holzmann@L3S.de) 06/15/2017
http://tempas.L3S.de/...?software=866&publication=01415032
13
06/15/2017 Helge Holzmann (holzmann@L3S.de)
14
06/15/2017 Helge Holzmann (holzmann@L3S.de) 15
06/15/2017 Helge Holzmann (holzmann@L3S.de) 16
06/15/2017 Helge Holzmann (holzmann@L3S.de)
17
06/15/2017 Helge Holzmann (holzmann@L3S.de)
[Helge Holzmann, Avishek Anand - “Tempas: Temporal Archive Search Based on Tags”. WWW 2016] [Helge Holzmann, Wolfgang Nejdl, Avishek Anand - “On the Applicability of Delicious for Temporal Search on Web Archives”. SIGIR 2016] 18
06/15/2017 Helge Holzmann (holzmann@L3S.de)
[Helge Holzmann, Wolfgang Nejdl, Avishek Anand - “Exploring Web Archives Through Temporal Anchor Texts”. WebSci' 2017 (to appear)]
anchor text a, based on freq(v,a):
19
06/15/2017 Helge Holzmann (holzmann@L3S.de)
20
06/15/2017 Helge Holzmann (holzmann@L3S.de)
21
Helge Holzmann (holzmann@L3S.de)
22
06/15/2017
06/15/2017 Helge Holzmann (holzmann@L3S.de)
23
Universities Game websites Total registered
1999
06/15/2017 Helge Holzmann (holzmann@L3S.de)
doubling their volume every two years
will be greater than today’s
[Helge Holzmann, Wolfgang Nejdl and Avishek Anand - “The Dawn of Today's Popular Domains: A Study of the Archived German Web over 18 Years”. JCDL 2016] 24
→ 2020: ~6 times the number of URLs per domain as in 2014
06/15/2017 Helge Holzmann (holzmann@L3S.de)
25
Source: Yahoo!
MapReduceLecture
06/15/2017 Helge Holzmann (holzmann@L3S.de)
26
ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In Proceedings of JCDL, Newark, New Jersey, USA, 2016.
06/15/2017 Helge Holzmann (holzmann@L3S.de)
27
28
06/15/2017 Helge Holzmann (holzmann@L3S.de)
[Helge Holzmann, Vinay Goel and Avishek Anand - “ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation”. JCDL 2016]
a) Select one particular URL b) Select all pages (MIME type text/html) under a specific domain c) Select the latest successful capture (HTTP status 200) in a specific month
29
06/15/2017 Helge Holzmann (holzmann@L3S.de)
Helge Holzmann (holzmann@L3S.de)
30
06/15/2017
06/15/2017 Helge Holzmann (holzmann@L3S.de)
31
06/15/2017 Helge Holzmann (holzmann@L3S.de)
32
06/15/2017 Helge Holzmann (holzmann@L3S.de)
33
source: http://www.okclipart.com/blue-fish-clipart90plakdqyz/
34 Helge Holzmann (holzmann@L3S.de)
06/15/2017 Helge Holzmann (holzmann@L3S.de)
35
1. / Identify time / keywords of interest
2. Find entry points for the study
3. Locate suitable documents in the archive
4. Detect and extract desired information
5. / Aggregate statistics and present results
06/15/2017 Helge Holzmann (holzmann@L3S.de)
36 [Helge Holzmann, Wolfgang Nejdl, Avishek Anand - “Exploring Web Archives Through Temporal Anchor Texts”. WebSci' 2017 (to appear)]
06/15/2017 Helge Holzmann (holzmann@L3S.de)
37 [Helge Holzmann, Wolfgang Nejdl, Avishek Anand - “Exploring Web Archives Through Temporal Anchor Texts”. WebSci' 2017 (to appear)]
Helge Holzmann (holzmann@L3S.de)
38
06/15/2017
05/16/2017 Helge Holzmann (holzmann@L3S.de)
Helge Holzmann (holzmann@L3S.de)
06/15/2017 www.HelgeHolzmann.de
39