web data engin ineering
play

Web Data Engin ineering: A Technical Perspective on Web Archives - PowerPoint PPT Presentation

Web Data Engin ineering: A Technical Perspective on Web Archives Dr. Helge Holzmann Web Data Engineer In Intern rnet Archive helge@archive.org Open Repositories 2019 Hamburg, Germany June 12, 2019 2019-06-12 Helge Holzmann


  1. Web Data Engin ineering: A Technical Perspective on Web Archives Dr. Helge Holzmann Web Data Engineer In Intern rnet Archive helge@archive.org Open Repositories 2019 Hamburg, Germany June 12, 2019

  2. 2019-06-12 Helge Holzmann (helge@archive.org) What is a web archive? • Web archives preserve our history as documented on the web… • … in huge datasets, consisting of all kinds of web resources • e.g., HTML pages, images, video, scripts, … • … stored as big files in the standardized ( W)ARC format • along with metadata + request / response headers • next to lightweight capture index files ( CDX ) • … to provide access to webpages from the past • for users through close reading • replayed by the Wayback Machine • for data analysis at scale through distant-reading • enabled by Big Data processing methods, like Hadoop / Spark, …

  3. 2019-06-12 Helge Holzmann (helge@archive.org) 3

  4. 2019-06-12 Helge Holzmann (helge@archive.org) 4

  5. 2019-06-12 Helge Holzmann (helge@archive.org) Not today's topic … http://blog.archive.org/2016/09/19/the-internet-archive-turns-20

  6. 2019-06-12 Helge Holzmann (helge@archive.org) The (archived) web… • ... is a very valuable dataset to study the web (and the offline world) • Access to very diverse knowledge from various discliplines (history, politics , …) • The whole web at your fingertips / processable snapshots • Adds a temporal dimension to the Web / captures dynamics • ... is a widely unstructured collection of data • Access and analysis at scale is challenging • Processing petabytes of data is expensive and time-consuming • Difficult to discover, identify, extract records and contained information • Potentially highly technical, complex access and parsing process • Low-level details users / researchers / data scientists don't want to / can't deal with • Data engineering needed to be used in downstream applications / studies 6

  7. 2019-06-12 Helge Holzmann (helge@archive.org) Different perspectives on web archives • User-centric View • (Temporal) Search / Information Retrieval • Direct access / replaying archived pages • Data-centric View • (W)ARC and CDX (metadata) datasets • Big data processing: Hadoop, Spark, … • Content analysis, historical / evolution studies • Graph-centric View • Structural view on the dataset • Graph algorithms / analysis, structured information • Hyperlink and host graphs, entity / social networks, facts and more 7 [Helge Holzmann. Concepts and Tools for the Effective and Efficient Use of Web Archives . PhD thesis 2019]

  8. 2019-06-12 Helge Holzmann (helge@archive.org) Web (archives) as graph • Foundational model for most downstream applications / analysis tasks • E.g., Search index construction, term / entity co- occurrence studies, … • Different ways / approaches to construct / extract (temporal) graphs • (Temporal) hyperlinks (hosts vs. URLs), social networks, knowledge graphs, etc. • Technical challenges that users don't want to / can't deal with: • Efficient generation, effective representation, … 8

  9. 2019-06-12 Helge Holzmann (helge@archive.org) (Temporal) search in web archives • Wanted: Enter a textual query , find relevant captures • Challenges: • Documents are temporal / consist of multiple versions • New captures could near-duplicates or relevant changes • Temporal relevance in addition to textual relevance • Relevance to the query is not always encoded in the content • Information needs / query intents are different from traditional IR • Mostly navigational : Under which URL can I find a specific resource? • How to turn (temporal) graphs into a searchable index ? • Integrate full-text, titles, headlines, anchor texts, ...? • Convert into a format supported by Information Retrieval systems , e.g. ElasticSearch • Adaptation of existing retrieval models 9

  10. 2019-06-12 Helge Holzmann (helge@archive.org) Web Data Engineering • Transforming data into useful information • Making it usable for downstream applications • Search, data science, digital humanities, content analysis, ... • Regular users, researchers, data scientists / analysts, ... • Enabling efficient and effective access through... • ... infrastructures • ... suitable data formats • ... simple tools / APIs • ... optimized indexes • Technical considerations made by computer scientists • to help users / researchers focus on their application / study / research • to hiding complexity / low-level details through flexible abstractions 10

  11. 2019-06-12 Helge Holzmann (helge@archive.org) Example: Language Analysis (1) • Possible research questions: • Which pages of a language exist outside the contries ccTLD? • Which languages are used the most in a certain area / topic? • How has a language evolved over time on the web? • Requirements: • Tools for (W)ARC access, HTML parsing, language detection • Language-annotated pages / captures • Challenges: • Texts too short to detect a language / confidence scores • Multiple languages on one page / filtering and weighting • Slow and expensive processing due to large-scale content analysis (weeks) 11

  12. 2019-06-12 Helge Holzmann (helge@archive.org) Example: Language Analysis (2) • Wanted: • Efficient access to comprehensive results • Lightweight, reusable exchange format • Dynamic threshold / flexible post-filtering • Solution: (CDX) Attachment Format (ATT / CDXA ) • Leightweight, efficient loading, integrated data validation, decoupled from data CDX (Capture Index) with pointers to correcsponding (W)ARC records: *.cdx.lang_2017-18_v2.cdxa.gz *.cdx # Language detection using 'square leaf' approach com,yahoo,answers,es )/ 20060616001149 http://es.an … 200 Y2P2LXHTCPGLNZOFAZ Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W es:82 com,yahoo,answers,espanol )/ 20060617034947 http:// … text/html 200 RMMUE3QW RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ es:97 com,yahoo,answers,fr )/ 20060625153331 http://fr.an … 200 3OLFJYPP5Y3V75OPD5 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW fr:54,en:7 com,yahoo,answers,hk )/ 20150819101628 https://hk.a … 0 5CUBOU4KW75IILS5D6H6 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC id:94,en:2 com,yahoo,answers,id )/ 20070629224925 http://id.an … 200 XEXA32HHEAHWLVN52J com,yahoo,answers,in )/ 20060422210325 http://in.an … 200 7LZJPKLXDVE5DG2RIO 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y en:97 com,yahoo,answers,it )/ 20060618041859 http://it.an … 200 45PAAZHDBCJY65YSBX 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX it:80,en:12 12

  13. 2019-06-12 Helge Holzmann (helge@archive.org) We have more available (examples) • Dataset of all homepages in Global Wayback (GWB) – web.archive.org • Extracted from snapshot 20180911224740 • GWB-20180911224740_homepages.cdx.gz • Pre-processed attachments • GWB-20180911224740_homepages-*.cdx.gz • GWB-20180911224740_homepages-*.cdx. last-success-revisit .cdxa.gz • GWB-20180911224740_homepages-*.cdx. last-success-revisit.lang_2017-18 .cdxa.gz • GWB-20180911224740_homepages-*.cdx. last-success-revisit.lang_2017-18_v2 .cdxa.gz • GWB-20180911224740_homepages-*.cdx. last-success .cdxa.gz • GWB-20180911224740_homepages-*.cdx. last .cdxa.gz # The last available capture Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W com,yahoo,answers,es)/ 20180904025943 https://es.answers.yahoo.com/ text/html 200 GG5KH5IZBH3X RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ com,yahoo,answers,espanol)/ 20180905123902 https://espanol.answers.yahoo.com/ text/html 200 EA 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW com,yahoo,answers,fr)/ 20180904220720 https://fr.answers.yahoo.com/ text/html 200 PHFBMN4ZE5CF 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI com,yahoo,answers,hk)/ 20180903232241 https://hk.answers.yahoo.com/ text/html 200 ELEYZG4TWCM5 XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC com,yahoo,answers,id)/ 20180903231347 https://id.answers.yahoo.com/ text/html 200 SNSCWXFNXPO5 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y com,yahoo,answers,in)/ 20180906005337 http://in.answers.yahoo.com/ text/html 301 7E7XC5R5K34US 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX com,yahoo,answers,it)/ 20180903232244 https://it.answers.yahoo.com/ text/html 200 LSSQLAY2SJY5 13

  14. Fatcat.wiki (beta) Archive and knowledge graph of every publicly-accessible scholarly output with a priority on long-tail, at-risk publications .

  15. 2019-06-12 Helge Holzmann (helge@archive.org) Fatcat.wiki (big catalog) • At-scale web harvesting of scholarly works • with descriptive metadata and full-text • linked with versions and secondary outputs • API-first accessible / editable system 15

  16. 2019-06-12 Helge Holzmann (helge@archive.org) Challenge: the Internet Archive is big • Web archive / Wayback Machine • 20+ years of web • 625+ library and other partners • 753,932,022,000 (captured) URLs • 362 billion web pages • More than 5,000 URLs archived every second • 40+ petabyte • And there's more:

  17. 2019-06-12 Helge Holzmann (helge@archive.org) Challenge: web archives are Big Data • Processing requires computing clusters • i.e., Hadoop, YARN, Spark, … • MapReduce or variants • Homogeneous data types / formats • Distributed batch processing • load → transform • aggregate → write • Web archive data is heterogeneous , may include text, video, images, … • Common header / metadata format, but various / diverse payloads • Requires cleaning, filtering, selection, extraction before processing 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend