Semantic Sitemaps R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, - - PowerPoint PPT Presentation

semantic sitemaps
SMART_READER_LITE
LIVE PREVIEW

Semantic Sitemaps R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, - - PowerPoint PPT Presentation

Semantic Sitemaps R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, G. Tummarello DERI Galway A new Web (of Data) Old: documents for Web browsers New: structured data for mashups and application integration Key technology: RDF


slide-1
SLIDE 1

Semantic Sitemaps

  • R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, G. Tummarello

DERI Galway

slide-2
SLIDE 2

A new Web (of Data)

  • Old: documents for Web browsers
  • New: structured data for mashups and

application integration

  • Key technology: RDF
slide-3
SLIDE 3

Observation: Costs are shifting

  • Access to RDF data was hard

(DMOZ, MusicBrainz)

  • Today: SPARQL protocol, Linked Data,

Tabulator, …

  • Today: More data (FOAF, Linking Open Data)
  • Problem is no longer access but discovery
slide-4
SLIDE 4

Towards a map of the new Web

  • Swoogle
  • SWSE
  • Falcon-S
  • Watson
  • Sindice
slide-5
SLIDE 5

3 challenges

slide-6
SLIDE 6
  • 1. Different access methods

Linked Data RDF dumps SPARQL endpoints

slide-7
SLIDE 7
  • GET to http://dbpedia.org/resource/Tenerife
  • Dumps from http://downloads.dbpedia.org/
  • SPARQL to http://dbpedia.org/sparql
  • Same data everywhere
slide-8
SLIDE 8
  • 2. Crawl performance
  • Toy servers, aggressive crawlers
  • 1 request per second = 2.6M per month
  • Geonames has 6M+ entities
  • If a dump is available, how would a crawler

know?

slide-9
SLIDE 9
  • 3. Provenance
  • Is built-in feature of the Web (DNS)
  • URI ownership, authoritative information
  • Delegation of URI space not visisble
slide-10
SLIDE 10

Proposed solution

slide-11
SLIDE 11

Semantic Sitemaps

  • Publishers tell us where they have RDF data
  • Based on Google’s Sitemap protocol
  • Put a simple XML file on your server
slide-12
SLIDE 12

<urlset> <url> <loc>http://www.example.com/</loc> <lastmod>2008-01-01</lastmod> <changefreq>monthly</changefreq> </url> ... more ... </urlset>

Google’s Sitemap protocol

http://example.com/sitemap.xml

slide-13
SLIDE 13

Semantic Sitemaps

<urlset> ... <sc:dataset> </sc:dataset> </urlset>

slide-14
SLIDE 14

Semantic Sitemaps

<urlset> ... <sc:dataset> <sc:linkedDataPrefix> http://dbpedia.org/resource/ </sc:linkedDataPrefix> </sc:dataset> </urlset>

slide-15
SLIDE 15

Semantic Sitemaps

<urlset> ... <sc:dataset> <sc:linkedDataPrefix> http://dbpedia.org/resource/ </sc:linkedDataPrefix> <sc:dataDumpLocation> http://downloads.dbpedia.org/dump.nt.gz </sc:dataDumpLocation> </sc:dataset> </urlset>

slide-16
SLIDE 16

Semantic Sitemaps

<urlset> ... <sc:dataset> <sc:linkedDataPrefix> http://dbpedia.org/resource/ </sc:linkedDataPrefix> <sc:dataDumpLocation> http://downloads.dbpedia.org/dump.nt.gz </sc:dataDumpLocation> <sc:sparqlEndpointLocation> http://dbpedia.org/sparql </sc:sparqlEndpointLocation> </sc:dataset> </urlset>

slide-17
SLIDE 17

Semantic Sitemaps

<urlset> ... <sc:dataset> <sc:linkedDataPrefix> http://dbpedia.org/resource/ </sc:linkedDataPrefix> <sc:dataDumpLocation> http://downloads.dbpedia.org/dump.nt.gz </sc:dataDumpLocation> <sc:sparqlEndpointLocation> http://dbpedia.org/sparql </sc:sparqlEndpointLocation> <changefreq>monthly</changefreq> </sc:dataset> </urlset>

slide-18
SLIDE 18

More elements

  • sc:datasetLabel: Name for the dataset
  • sc:datasetURI: Hook for additional metadata
  • sc:authority: Hook for identifying the publisher
  • sc:sampleURI: Some representative URIs from the DS
slide-19
SLIDE 19

Why XML?

  • Conservative webmasters
  • Simple
slide-20
SLIDE 20

Sitemap discovery

User-agent: * Disallow: Sitemap: sitemap.xml

http://domain/robots.txt

<urlset> ... </urlset>

http://domain/sitemap.xml domain

slide-21
SLIDE 21
  • 1. Different access methods
  • Clients can choose between
  • sc:linkedDataPrefix
  • sc:dataDumpLocation
  • sc:sparqlEndpointLocation
slide-22
SLIDE 22
  • 2. Crawl performance
  • Crawlers can discover and use RDF dump
  • Experiment: Downloading and slicing

Uniprot takes ~25h and can be parallelized

  • Crawling Uniprot would take ~5 months
  • Bottleneck moves from retrieval to

indexing

slide-23
SLIDE 23
  • 3. Provenance
  • Delegating and joining URI spaces with

sc:subSitemap and sc:parentSitemap

  • Describing the publisher with sc:authority
  • URI space can be authoritatively served

from a dump or SPARQL endpoint

slide-24
SLIDE 24
  • Most large LOD datasets have a sitemap
  • Supported by Sindice and SWSE
  • Publishers are receptive
  • They want a validator
  • public-lod@w3.org mailing list

Community and adoption

slide-25
SLIDE 25

Next steps

  • Updated draft
  • Sitemap creator + validator
  • Work on content descriptions (VOID)
slide-26
SLIDE 26

Semantic Sitemaps …

  • … are a proposal for better RDF discovery
  • … allow publishers to announce their data
  • … allow consumers to efficiently find it
  • … have hooks for describing content and

authority

slide-27
SLIDE 27

http://sw.deri.org/2007/07/sitemapextension/

richard@cyganiak.de