Web Data Engin ineering: A Technical Perspective on Web Archives - - PowerPoint PPT Presentation

web data engin ineering
SMART_READER_LITE
LIVE PREVIEW

Web Data Engin ineering: A Technical Perspective on Web Archives - - PowerPoint PPT Presentation

Web Data Engin ineering: A Technical Perspective on Web Archives Dr. Helge Holzmann Web Data Engineer In Intern rnet Archive helge@archive.org Open Repositories 2019 Hamburg, Germany June 12, 2019 2019-06-12 Helge Holzmann


slide-1
SLIDE 1

Web Data Engin ineering:

A Technical Perspective on Web Archives

  • Dr. Helge Holzmann

Web Data Engineer In Intern rnet Archive helge@archive.org

Open Repositories 2019

Hamburg, Germany June 12, 2019

slide-2
SLIDE 2

What is a web archive?

  • Web archives preserve our history as documented on the web…
  • … in huge datasets, consisting of all kinds of web resources
  • e.g., HTML pages, images, video, scripts, …
  • … stored as big files in the standardized (W)ARC format
  • along with metadata + request / response headers
  • next to lightweight capture index files (CDX)
  • … to provide access to webpages from the past
  • for users through close reading
  • replayed by the Wayback Machine
  • for data analysis at scale through distant-reading
  • enabled by Big Data processing methods, like Hadoop / Spark, …

Helge Holzmann (helge@archive.org) 2019-06-12

slide-3
SLIDE 3

2019-06-12 Helge Holzmann (helge@archive.org) 3

slide-4
SLIDE 4

2019-06-12 Helge Holzmann (helge@archive.org) 4

slide-5
SLIDE 5

Not today's topic …

2019-06-12 Helge Holzmann (helge@archive.org)

http://blog.archive.org/2016/09/19/the-internet-archive-turns-20

slide-6
SLIDE 6

The (archived) web…

  • ... is a very valuable dataset to study the web (and the offline world)
  • Access to very diverse knowledge from various discliplines (history, politics, …)
  • The whole web at your fingertips / processable snapshots
  • Adds a temporal dimension to the Web / captures dynamics
  • ... is a widely unstructured collection of data
  • Access and analysis at scale is challenging
  • Processing petabytes of data is expensive and time-consuming
  • Difficult to discover, identify, extract records and contained information
  • Potentially highly technical, complex access and parsing process
  • Low-level details users / researchers / data scientists don't want to / can't deal with
  • Data engineering needed to be used in downstream applications / studies

2019-06-12 Helge Holzmann (helge@archive.org)

6

slide-7
SLIDE 7

Different perspectives on web archives

  • User-centric View
  • (Temporal) Search / Information Retrieval
  • Direct access / replaying archived pages
  • Data-centric View
  • (W)ARC and CDX (metadata) datasets
  • Big data processing: Hadoop, Spark, …
  • Content analysis, historical / evolution studies
  • Graph-centric View
  • Structural view on the dataset
  • Graph algorithms / analysis, structured information
  • Hyperlink and host graphs, entity / social networks, facts and more

2019-06-12 Helge Holzmann (helge@archive.org)

7 [Helge Holzmann. Concepts and Tools for the Effective and Efficient Use of Web Archives. PhD thesis 2019]

slide-8
SLIDE 8

Web (archives) as graph

  • Foundational model for most downstream applications / analysis tasks
  • E.g., Search index construction, term / entity co-occurrence studies, …
  • Different ways / approaches to construct / extract (temporal) graphs
  • (Temporal) hyperlinks (hosts vs. URLs), social networks, knowledge graphs, etc.
  • Technical challenges that users don't want to / can't deal with:
  • Efficient generation, effective representation, …

Helge Holzmann (helge@archive.org) 2019-06-12

8

slide-9
SLIDE 9

(Temporal) search in web archives

  • Wanted: Enter a textual query, find relevant captures
  • Challenges:
  • Documents are temporal / consist of multiple versions
  • New captures could near-duplicates or relevant changes
  • Temporal relevance in addition to textual relevance
  • Relevance to the query is not always encoded in the content
  • Information needs / query intents are different from traditional IR
  • Mostly navigational: Under which URL can I find a specific resource?
  • How to turn (temporal) graphs into a searchable index?
  • Integrate full-text, titles, headlines, anchor texts, ...?
  • Convert into a format supported by Information Retrieval systems, e.g. ElasticSearch
  • Adaptation of existing retrieval models

2019-06-12 Helge Holzmann (helge@archive.org)

9

slide-10
SLIDE 10

Web Data Engineering

  • Transforming data into useful information
  • Making it usable for downstream applications
  • Search, data science, digital humanities, content analysis, ...
  • Regular users, researchers, data scientists / analysts, ...
  • Enabling efficient and effective access through...
  • ... infrastructures
  • ... suitable data formats
  • ... simple tools / APIs
  • ... optimized indexes
  • Technical considerations made by computer scientists
  • to help users / researchers focus on their application / study / research
  • to hiding complexity / low-level details through flexible abstractions

2019-06-12 Helge Holzmann (helge@archive.org)

10

slide-11
SLIDE 11

Example: Language Analysis (1)

  • Possible research questions:
  • Which pages of a language exist outside the contries ccTLD?
  • Which languages are used the most in a certain area / topic?
  • How has a language evolved over time on the web?
  • Requirements:
  • Tools for (W)ARC access, HTML parsing, language detection
  • Language-annotated pages / captures
  • Challenges:
  • Texts too short to detect a language / confidence scores
  • Multiple languages on one page / filtering and weighting
  • Slow and expensive processing due to large-scale content analysis (weeks)

2019-06-12 Helge Holzmann (helge@archive.org)

11

slide-12
SLIDE 12

Example: Language Analysis (2)

  • Wanted:
  • Efficient access to comprehensive results
  • Lightweight, reusable exchange format
  • Dynamic threshold / flexible post-filtering
  • Solution: (CDX) Attachment Format (ATT / CDXA)
  • Leightweight, efficient loading, integrated data validation, decoupled from data

2019-06-12 Helge Holzmann (helge@archive.org)

12 # Language detection using 'square leaf' approach Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W es:82 RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ es:97 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW fr:54,en:7 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC id:94,en:2 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y en:97 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX it:80,en:12 com,yahoo,answers,es)/ 20060616001149 http://es.an … 200 Y2P2LXHTCPGLNZOFAZ com,yahoo,answers,espanol)/ 20060617034947 http:// … text/html 200 RMMUE3QW com,yahoo,answers,fr)/ 20060625153331 http://fr.an … 200 3OLFJYPP5Y3V75OPD5 com,yahoo,answers,hk)/ 20150819101628 https://hk.a … 0 5CUBOU4KW75IILS5D6H6 com,yahoo,answers,id)/ 20070629224925 http://id.an … 200 XEXA32HHEAHWLVN52J com,yahoo,answers,in)/ 20060422210325 http://in.an … 200 7LZJPKLXDVE5DG2RIO com,yahoo,answers,it)/ 20060618041859 http://it.an … 200 45PAAZHDBCJY65YSBX

*.cdx.lang_2017-18_v2.cdxa.gz CDX (Capture Index) with pointers to correcsponding (W)ARC records: *.cdx

slide-13
SLIDE 13

We have more available (examples)

  • Dataset of all homepages in Global Wayback (GWB) – web.archive.org
  • Extracted from snapshot 20180911224740
  • GWB-20180911224740_homepages.cdx.gz
  • Pre-processed attachments
  • GWB-20180911224740_homepages-*.cdx.gz
  • GWB-20180911224740_homepages-*.cdx.last-success-revisit.cdxa.gz
  • GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18.cdxa.gz
  • GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18_v2.cdxa.gz
  • GWB-20180911224740_homepages-*.cdx.last-success.cdxa.gz
  • GWB-20180911224740_homepages-*.cdx.last.cdxa.gz

2019-06-12 Helge Holzmann (helge@archive.org)

13 # The last available capture Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W com,yahoo,answers,es)/ 20180904025943 https://es.answers.yahoo.com/ text/html 200 GG5KH5IZBH3X RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ com,yahoo,answers,espanol)/ 20180905123902 https://espanol.answers.yahoo.com/ text/html 200 EA 3OLFJYPP5Y3V75OPD57BTIHNHLPHL5IW com,yahoo,answers,fr)/ 20180904220720 https://fr.answers.yahoo.com/ text/html 200 PHFBMN4ZE5CF 5CUBOU4KW75IILS5D6H6DR53YDHS3ZWI com,yahoo,answers,hk)/ 20180903232241 https://hk.answers.yahoo.com/ text/html 200 ELEYZG4TWCM5 XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC com,yahoo,answers,id)/ 20180903231347 https://id.answers.yahoo.com/ text/html 200 SNSCWXFNXPO5 7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y com,yahoo,answers,in)/ 20180906005337 http://in.answers.yahoo.com/ text/html 301 7E7XC5R5K34US 45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX com,yahoo,answers,it)/ 20180903232244 https://it.answers.yahoo.com/ text/html 200 LSSQLAY2SJY5

slide-14
SLIDE 14

Fatcat.wiki

(beta)

Archive and knowledge graph

  • f every publicly-accessible

scholarly output with a priority

  • n long-tail, at-risk publications.
slide-15
SLIDE 15

Fatcat.wiki (big catalog)

  • At-scale web harvesting of scholarly works
  • with descriptive metadata and full-text
  • linked with versions and secondary outputs

2019-06-12 Helge Holzmann (helge@archive.org)

15

  • API-first accessible /

editable system

slide-16
SLIDE 16

Challenge: the Internet Archive is big

  • Web archive / Wayback Machine
  • 20+ years of web
  • 625+ library and other partners
  • 753,932,022,000 (captured) URLs
  • 362 billion web pages
  • More than 5,000 URLs archived every second
  • 40+ petabyte
  • And there's more:

2019-06-12 Helge Holzmann (helge@archive.org)

slide-17
SLIDE 17

Challenge: web archives are Big Data

  • Processing requires computing clusters
  • i.e., Hadoop, YARN, Spark, …
  • Web archive data is heterogeneous, may include text, video, images, …
  • Common header / metadata format, but various / diverse payloads
  • Requires cleaning, filtering, selection, extraction before processing
  • MapReduce or variants
  • Homogeneous data types / formats
  • Distributed batch processing
  • load → transform
  • aggregate → write

2019-06-12 Helge Holzmann (helge@archive.org)

17

slide-18
SLIDE 18

Trade-off: data locality vs. random access

  • Direct access allows for exploiting data locality
  • Moving computations to the data / sequential scans
  • Indirect access with selective random accesses
  • Scanning sequentially results in wasted reads (PB)

Helge Holzmann (helge@archive.org)

18

2019-06-12

slide-19
SLIDE 19

Efficient processing

  • Indirect access via lightweight metadata (CDX)
  • Basic operations on metadata before touching the archive (filter, group, sort)
  • E.g., offline pages, data types (scripts, styles, images, ...), domains
  • Enriching records with data from payload for downstream applications
  • E.g., titles, headlines, links, part-of-speach, named entities, ...

2019-06-12 Helge Holzmann (helge@archive.org)

19

slide-20
SLIDE 20

Sparkling data processing ☆

  • (Internal) data processing library based on Apache Spark
  • Goal to integrate all APIs to work with (temporal) web data in one library
  • Continuous work in progress, growing with every new task
  • Rich of features
  • Efficient CDX / (W)ARC loading, parsing and storing from HDFS, Petabox, …
  • Fast HTML processing without expensive DOM parsing (SAX-like)
  • Internal PetaBox authentication / access features
  • ATT / CDXA attachment loaders and writers
  • Shell / Python integration for computing derivations
  • Distributed budget-aware repartitioning (e.g., 1GB per partition / file)
  • Advanced retry / timeout / failure handling
  • Lots of utilities for logging, file handling, string operations, URL/SURT formatting, …
  • Easily configurable, library-wide constants and settings

Helge Holzmann (helge@archive.org) 2019-06-12

20

slide-21
SLIDE 21

ArchiveSpark

  • Expressive and efficient data access and processing
  • Declarative workflows, seamless two step loading approach
  • Open source
  • Available on GitHub: https://github.com/helgeho/ArchiveSpark
  • with documentation, docker image, and recipes for common tasks
  • Modular / extensible
  • Various DataSpecifications and EnrichFunctions
  • ArchiveSpark-server: Web service API for ArchiveSpark
  • https://github.com/helgeho/ArchiveSpark-server
  • Generalizable for archival collections beyond Web archives

Helge Holzmann (helge@archive.org) 2019-06-12

21 [Helge Holzmann, Vinay Goel and Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016] [Helge Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017]

slide-22
SLIDE 22

Simple and expressive interface

  • Based on Spark, powered by Scala
  • This does not mean you have to learn a new programming language!
  • The interface is rather declarative / no deep scala or spark knowledge required
  • Simple data accessors are included
  • Provide simplified access to the underlying data model
  • Easy extraction / enrichment mechanisms
  • Customizable and extensible by advanced users

Helge Holzmann (helge@archive.org) val rdd = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath)) val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html") val entities = onlineHtml.enrich(Entities) entities.saveAsJson("entities.gz")

22

2019-06-12

slide-23
SLIDE 23

Familiar, readable, reusable output

  • Nested JSON output encodes lineage of applied enrichments

Helge Holzmann (helge@archive.org)

title text entities persons

23

2019-06-12

slide-24
SLIDE 24

Benchmarks vs. Spark / HBase

  • Three scenarios, from basic to more sophisticated:

a) Select one particular URL b) Select all pages (MIME type text/html) under a specific domain c) Select the latest successful capture (HTTP status 200) in a specific month

  • Benchmarks do not include derivations
  • Those are applied on top of all three methods and involve third-party libraries

2019-06-12 Helge Holzmann (helge@archive.org)

24

slide-25
SLIDE 25

New ArchiveSpark (3.0) very ry soon

  • Major overhaul
  • Streamlined dependencies and package structure
  • Even more simplified API
  • Lots of bug fixes and improvements
  • Will be widely based on / include parts of Sparkling
  • org.archive.archivespark.sparkling
  • Will benefit from Sparkling fixes and updates
  • Almost ready
  • Please have a little patience and check back soon…
  • Follow / star / watch on GitHub
  • https://github.com/helgeho/ArchiveSpark

Helge Holzmann (helge@archive.org) 2019-06-12

25

slide-26
SLIDE 26

We're at your service!

  • Archive-It Research Services (ARS)
  • WAT (extended metadata files)
  • LGA (temporal graphs)
  • WANE (named entities)
  • Special Seed Services (Artificial Zone Files)
  • Language + GeoIP analysis
  • Nation Wide Web (NWW) Search
  • Customized / regional web + media search
  • APIs
  • WASAPI data-transfer API (Archive-It)
  • Availability API + CDX Server (Wayback)
  • More to come soon, stay tuned…

2019-06-12 Helge Holzmann (helge@archive.org)

26

slide-27
SLIDE 27

Thank you!

Helge Holzmann (helge@archive.org)

  • archive.org
  • archive-it.org
  • fatcat.wiki
  • github.org/helgeho/ArchiveSpark

Questions?

2019-06-12 www.HelgeHolzmann.de

27

If interested in our work, please get in touch!