Web Data Engin ineering:
A Technical Perspective on Web Archives
- Dr. Helge Holzmann
Web Data Engineer In Intern rnet Archive helge@archive.org
Open Repositories 2019
Hamburg, Germany June 12, 2019
Web Data Engin ineering: A Technical Perspective on Web Archives - - PowerPoint PPT Presentation
Web Data Engin ineering: A Technical Perspective on Web Archives Dr. Helge Holzmann Web Data Engineer In Intern rnet Archive helge@archive.org Open Repositories 2019 Hamburg, Germany June 12, 2019 2019-06-12 Helge Holzmann
Web Data Engineer In Intern rnet Archive helge@archive.org
Open Repositories 2019
Hamburg, Germany June 12, 2019
Helge Holzmann (helge@archive.org) 2019-06-12
2019-06-12 Helge Holzmann (helge@archive.org) 3
2019-06-12 Helge Holzmann (helge@archive.org) 4
2019-06-12 Helge Holzmann (helge@archive.org)
http://blog.archive.org/2016/09/19/the-internet-archive-turns-20
2019-06-12 Helge Holzmann (helge@archive.org)
6
2019-06-12 Helge Holzmann (helge@archive.org)
7 [Helge Holzmann. Concepts and Tools for the Effective and Efficient Use of Web Archives. PhD thesis 2019]
Helge Holzmann (helge@archive.org) 2019-06-12
8
2019-06-12 Helge Holzmann (helge@archive.org)
9
2019-06-12 Helge Holzmann (helge@archive.org)
10
2019-06-12 Helge Holzmann (helge@archive.org)
11
2019-06-12 Helge Holzmann (helge@archive.org)
12 # Language detection using 'square leaf' approach Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W es:82 RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ es:97 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW fr:54,en:7 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC id:94,en:2 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y en:97 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX it:80,en:12 com,yahoo,answers,es)/ 20060616001149 http://es.an … 200 Y2P2LXHTCPGLNZOFAZ com,yahoo,answers,espanol)/ 20060617034947 http:// … text/html 200 RMMUE3QW com,yahoo,answers,fr)/ 20060625153331 http://fr.an … 200 3OLFJYPP5Y3V75OPD5 com,yahoo,answers,hk)/ 20150819101628 https://hk.a … 0 5CUBOU4KW75IILS5D6H6 com,yahoo,answers,id)/ 20070629224925 http://id.an … 200 XEXA32HHEAHWLVN52J com,yahoo,answers,in)/ 20060422210325 http://in.an … 200 7LZJPKLXDVE5DG2RIO com,yahoo,answers,it)/ 20060618041859 http://it.an … 200 45PAAZHDBCJY65YSBX
*.cdx.lang_2017-18_v2.cdxa.gz CDX (Capture Index) with pointers to correcsponding (W)ARC records: *.cdx
2019-06-12 Helge Holzmann (helge@archive.org)
13 # The last available capture Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W com,yahoo,answers,es)/ 20180904025943 https://es.answers.yahoo.com/ text/html 200 GG5KH5IZBH3X RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ com,yahoo,answers,espanol)/ 20180905123902 https://espanol.answers.yahoo.com/ text/html 200 EA 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW com,yahoo,answers,fr)/ 20180904220720 https://fr.answers.yahoo.com/ text/html 200 PHFBMN4ZE5CF 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI com,yahoo,answers,hk)/ 20180903232241 https://hk.answers.yahoo.com/ text/html 200 ELEYZG4TWCM5 XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC com,yahoo,answers,id)/ 20180903231347 https://id.answers.yahoo.com/ text/html 200 SNSCWXFNXPO5 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y com,yahoo,answers,in)/ 20180906005337 http://in.answers.yahoo.com/ text/html 301 7E7XC5R5K34US 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX com,yahoo,answers,it)/ 20180903232244 https://it.answers.yahoo.com/ text/html 200 LSSQLAY2SJY5
(beta)
2019-06-12 Helge Holzmann (helge@archive.org)
15
editable system
2019-06-12 Helge Holzmann (helge@archive.org)
2019-06-12 Helge Holzmann (helge@archive.org)
17
Helge Holzmann (helge@archive.org)
18
2019-06-12
2019-06-12 Helge Holzmann (helge@archive.org)
19
Helge Holzmann (helge@archive.org) 2019-06-12
20
Helge Holzmann (helge@archive.org) 2019-06-12
21 [Helge Holzmann, Vinay Goel and Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016] [Helge Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017]
Helge Holzmann (helge@archive.org) val rdd = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath)) val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html") val entities = onlineHtml.enrich(Entities) entities.saveAsJson("entities.gz")
22
2019-06-12
Helge Holzmann (helge@archive.org)
title text entities persons
23
2019-06-12
a) Select one particular URL b) Select all pages (MIME type text/html) under a specific domain c) Select the latest successful capture (HTTP status 200) in a specific month
2019-06-12 Helge Holzmann (helge@archive.org)
24
Helge Holzmann (helge@archive.org) 2019-06-12
25
2019-06-12 Helge Holzmann (helge@archive.org)
26
Helge Holzmann (helge@archive.org)
2019-06-12 www.HelgeHolzmann.de
27
If interested in our work, please get in touch!