Big (Linked) Semantic Data Compression
Motivation & Challenges
Antonio Fariña, Javier D. Fernández and
Miguel A. Martinez-Prieto
23TH AUGUST 2017
3rd KEYSTONE Training School Keyword search in Big Linked Data
Image: ROMAN AQUEDUCT (SEGOVIA, SPAIN)
Big (Linked) Semantic Data Compression Motivation & Challenges - - PowerPoint PPT Presentation
Big (Linked) Semantic Data Compression Motivation & Challenges Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data Image: ROMAN AQUEDUCT (S EGOVIA , SPAIN ) 23
Big (Linked) Semantic Data Compression
Motivation & Challenges
Antonio Fariña, Javier D. Fernández and
Miguel A. Martinez-Prieto
23TH AUGUST 2017
3rd KEYSTONE Training School Keyword search in Big Linked Data
Image: ROMAN AQUEDUCT (SEGOVIA, SPAIN)
PAGE 2
Agenda
images: zurb.com
Linked Data & Semantic Technologies
Big (Linked) Semantic Data Compression
Linked Data is simply about using the Web to create typed links between data from different sources.
Linked Data Foundations
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 4
Linked Data is simply about using the Web to create typed links between data from different sources.
data on the Web.
providers, leading to the creation of a global data space:
Linked Data
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 5
Linked Data
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 7
Linked Data Principles
up those names.
useful information, using standards (e.g. RDF , SPARQL).
they can discover more things.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 8
#1 URIs as names
What is his name?
“Homer Simpson” “Homer Simpson”
name data entities:
Identifier) enables any real-world entity to be identified at universal scale:
What is his name?
http://example.org/person/homer-simpson http://example.org/person/homer-simpson-guy
Names must ensure that any data entity has its
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 9
#2 HTTP URIs
be retrieved when an HTTP URI is accessed (via HTTP client).
http://example.org/person/homer-simpson
http://example.org/property/name "Homer Simpson" http://example.org/property/address "742 Evergreen Terrace" http://example.org/property/location http://example.org/place/springfield http://example.org/property/father http://example.org/person/abe-simpson ...
Entity names must be searchable (via HTTP).
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 10
#3 Standards
within the Linked Data ecosystem…
(economy, bioinformatics, multimedia…).
languages” for effective understanding.
Turtle, HDT…) for data storage.
existing entities:
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 11
#4 Links to Other URIs
person:homer-simpson
“Homer Simpson” "742 Evergreen Terrace"
property:address property:name person:marge-simpson
“Marge Simpson”
property:address property:name
@prefix person : <http://example.org/person/> . @prefix property : <http://example.org/property/> .
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 12
#4 Links to Other URIs
person:homer-simpson
“Homer Simpson” "742 Evergreen Terrace"
property:address property:name person:marge-simpson
“Marge Simpson”
property:address property:name location:Springfield
“Springfield”
property:name property:location property:location person:abe-simpson property:father
“Abe Simpson” 83
property:age property:name person:bart-simpson
“Bart Simpson” 10
property:age property:name property:father property:mother
@prefix person : <http://example.org/person/> . @prefix property : <http://example.org/property/> .
The Web of Linked Data revisits WWW foundations to build a cloud of data-to-data labelled hyperlinks.
Web of Linked Data
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 13
The Web of Linked Data
Knowledge from different fields can be easily integrated and universally shared/exploited using WWW infrastructure.
The Web of Linked Data (2007 – 2011)
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 14
http://lod-cloud.net/
The Web of Linked Data (2014)
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 15
http://lod-cloud.net/
The Web of Linked Data (2017)
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 16
domains which include many and varied knowledge fields.
entity descriptions and (inter/intra-dataset) links between them.
data.
http://lod-cloud.net/ http://stats.lod2.eu/ http://sparqles.ai.wu.ac.at/
RDF is a standard model for data interchange
data merging even if the underlying schemas differ…
RDF
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 17
consumption on the Web of Linked Data.
structure:
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 18
RDF Basics
http://example.org/person/homer-simpson http://example.org/property/name "Homer Simpson" http://example.org/person/homer-simpson http://example.org/property/father http://example.org/person/abe-simpson
subject and object nodes are linked by a particular (predicate) edge:
described by any vocabulary/ontology.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 19
RDF Triples
@prefix person : <http://example.org/person/> . @prefix property : <http://example.org/property/> . person:homer-simpson person:abe-simpson "Homer Simpson" property:name "742 Evergreen Terrace" property:address property:father
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 20
RDF Graphs
@prefix person : <http://example.org/person/> . @prefix property : <http://example.org/property/> .
person:homer-simpson person:abe-simpson "Homer Simpson" property:name "742 Evergreen Terrace" property:address property:father person:marge-simpson property:address "Marge Simpson" property:name location:springfield property:location property:location person:bart-simpson "Springfield" property:mother property:father property:name "Bart Simpson" 10 property:name property:age 83 "Bart Simpson" property:age property:name
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 21
RDF Graphs
effective storage:
most relevant tasks in the Web of Linked Data.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 23
RDF Serialization Formats
NTriples RDF/XML N3 JSON/LD
http://www.easyrdf.org/converter
SPARQL is a semantic query language for databases, able to retrieve and manipulate data stored in Resource Description Framework (RDF) format.
SPARQL
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 24
manipulate RDF data.
query in a large data graph.
predicate and object may be a variable.
triple patterns.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 25
SPARQL Basics
captured by the property
http://example.org/property/location.
is named by the URI
http://example.org/location/springfield.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 26
SPARQL Querying
Who lives in Springfield?
?who
location:springfield property:location
@prefix person : <http://example.org/person/> . @prefix property : <http://example.org/property/> . SELECT ?Who WHERE { ?Who property:location location:Springfield }
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 27
SPARQL Querying
person:homer-simpson person:abe-simpson "Homer Simpson" property:name "742 Evergreen Terrace" property:address property:father person:marge-simpson property:address "Marge Simpson" property:name location:springfield property:location property:location person:bart-simpson "Springfield" property:mother property:father property:name "Bart Simpson" 10 property:name property:age 83 "Bart Simpson" property:age property:name
?who
location:springfield property:location location:springfield property:location property:location
?who ?who
(Some) Open Issues
Big (Linked) Semantic Data Compression
Data moves from data providers to end users within the Linked Data ecosystem. It evolves along many stages to consolidate effective results which satisfy end-user requirements.
Linked Data Workflow
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 29
RDF dataset.
services.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 31
Linked Data Workflow
Real-world Facts Data Creator RDF Dataset Data Provider HTTP URIs Dump Endpoint LD Fragment Data Consumers
GENERATION
PUBLICATION CONSUMP TION
quality data which is finally modeled using RDF:
transformation.
relevant entities in the Linked Data Web:
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 32
RDF Data Generation
The new RDF dataset must be exposed in the Linked Data Web.
I. Triples are serialized (and possible compressed) into a valid RDF format. II. The resulting dump file is hosted at a web server and registered into a central catalog (e.g. datahub.io) for discovering purposes.
I. Triples are stored and indexed (using possible a semantic database). II. An interface is exposed for dereferencing URIs.
I. Triples are stored and indexed using a semantic database. II. An SPARQL interface is exposed for querying.
I. Triples are self-indexed using an in-memory RDF engine (HDT). II. An LD interface is exposed for (federated) querying.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 33
RDF Data Publication
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 34
RDF Data Consumption
Consumers retrieve RDF data to meet their information requirements.
matching particular needs:
retrieved from the corresponding publisher.
SPARQL queries in a particular dataset.
along the Web of Linked Data.
This processing workflow suffers when Big Linked Data must be managed along it.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 35
Linked Data Workflow
Real-world Facts Data Creator RDF Dataset Data Provider HTTP URIs LD Fragment Endpoint Dump Data Consumers
CONSUMP TION
Big is not a matter of size... it is a matter or representativity & consumption capacity.
Big Linked Data Challenges
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 36
We refer as Big Linked Data (BLD) an RDF dataset that exceeds the capacity of conventional tools used to implement the processing workflow.
validity, vulnerability…) .
Big Linked Data
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 37
Big Linked Data
RDF sources (e.g. meteorology or traffic sensors, social networks…).
amounts of RDF data along the cleansing/integration stages.
corresponding triples (HTTP URIs).
query resolution (SPARQL endpoints).
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 38
Big Linked Data Challenges
Semantic Data Compression
Big (Linked) Semantic Data Compression
serializing RDF triples:
required by traditional formats.
address some of the Big Linked Data challenges:
(possibly) requires less amounts of memory for triples parsing.
with no prior triples decoding.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 40
Semantic Data Compression
Data redundancy means that the same information can be encoded using less bits.
Why Semantic Data is Redundant?
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 41
be encoded using less bits:
is removed from the original dataset.
semantic redundancy.
dataset → symbolic redundancy.
itself → syntactic (structural) redundancy.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 42
Semantic Data Redundancy
using less triples:
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 43
Semantic Redundancy
http://example.org/property/age http://www.w3.org/2000/01/rdf−schema#domain http://example.org/classes/person http://example.org/person/bart-simpson http://example.org/property/age 10 http://example.org/person/bart-simpson http://www.w3.org/1999/02/22−rdf−syntax−ns#type http://example.org/classes/person
person/bart-simpson is the type http://example.org/classes/person because of the
second triple (it provides the age of the person).
compressors.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 44
Symbolic Redundancy
http://example.org/class/person http://example.org/property/address http://example.org/property/age http://example.org/property/location http://example.org/person/abe-simpson http://example.org/person/bart-simpson http://example.org/person/homer-simpson http://example.org/person/marge-simpson Abe Simpson Marge Simpson
serialized:
resource) writes n times the subject value. It can be abbreviated.
the same sub-graph structure.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 45
Syntactic Redundancy
The current state of the art comprises a rich and varied set of compressors for RDF data. These are mainly lossless compressors (because they preserve the original knowledge in the dataset), yet lossy compressors are also emerging
Compression Approaches
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 46
lossless compression:
knowledge is not acceptable for the Linked Data workflow.
redundancy from the original dataset.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 47
Compression Approaches
to the particular case of RDF compression:
triples to be rewritten as 3-IDs tuples (ID graph).
removes different kinds of syntactic redundancy.
space than their uncompressed counterparts.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 48
Physical Compressors
RDFCSA) are really self-indexes:
decompression.
inferred), and remove them from the dataset.
triples”.
subgraphs.
compressors.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 49
Logical Compressors
redundancy at logical level.
(dictionary + graph compression).
most promissory:
space requirements.
better overall search performance.
space/time tradeoffs.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 50
Hybrid Compressors
RDF compression is a mature field of research, but the current state of the art has many room for optimization.
Achievements & Challenges
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 51
datasets (e.g. LOD Laundromat).
been adopted by many other tools in the Semantic Web community.
efficiently performed.
compression and SPARQL triples pattern resolution.
single dataset: LOD-a-lot.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 52
RDF Compression Achievements
Sandra Mario Nieves Ana Antonio Javier D. José M. Claudio Antonio Miguel A. Gonzalo Axel
improving scalability.
performance.
currently working on it.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 53
RDF Compression Challenges
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 54
Bibliography
1. Sandra Alvarez-Garccía, Nieves Brisaboa, Javier D. Fernáandez, Miguel A. Martínez-Prieto, and Gonzalo Navarro. Compressed Vertical Partitioning for Efficient RDF Management. Knowledge and Information Systems (KAIS), 44(2):439–474, 2015. 2. Tim Berners-Lee. Linked Data, 2006. http://www.w3.org/DesignIssues/LinkedData.html. 3. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked Data - The Story So Far. International Journal of Semantic Web and Information Systems, 5(3):1–22, 2009. 4. Nieves Brisaboa, Ana Cerdeira, Antonio Fariña, and Gonzalo Navarro. A Compact RDF Store using Suffix Arrays. In Proceedings of SPIRE, pages 103-115, 2015. 5. Javier D. Fernández, Mario Arias, Miguel A. Martínez-Prieto, and Claudio Gutiérrez. Management of Big Semantic
6. Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez, and Axel Polleres. Binary RDF Representation for Publication and Exchange. W3C Member Submission, 2011. www.w3.org/Submission/HDT/. 7. Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez, Axel Polleres, and Mario Arias. Binary RDF Representation for Publication and Exchange. Journal of Web Semantics, 19:22–41, 2013.
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 55
Bibliography
8. Tom Heath and Christian Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 1 edition, 2011. http://linkeddatabook.com/. 9. Amit K. Joshi, Pascal Hitzler, and Guozhu Dong. Logical Linked Data Compression. In Proceedings of ESWC, pages 170–184, 2013.
Summarisation, Serialisation and Predictive Encoding. Technical report, 2014. Available at http://www.kdrive- project.eu/wp-content/uploads/2014/06/WP3-TR2-2014 SSP.pdf.
http://www.w3.org/TR/rdf-sparql-query/.
Basics of Data Compression
Let’s the lecture continues…
Image: MAIN SQUARE & CATHEDRAL (SEGOVIA, SPAIN)