Crawling Structured Data Crawling, session 10 CS6200: Information - - PowerPoint PPT Presentation

▶

Dec 29, 2023 222 likes •299 views

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse Anderton Structured Web Data In addition to unstructured document contents, a great deal of structured data exists on the web. Well focus here on

SLIDE 1

CS6200: Information Retrieval

Slides by: Jesse Anderton

Crawling Structured Data

Crawling, session 10

SLIDE 2

In addition to unstructured document contents, a great deal of structured data exists on the web. We’ll focus here on two types:

Document feeds, which sites use to announce their new content
Content metadata, used by web authors to publish structured

properties of objects on their site

Structured Web Data

SLIDE 3

Sites which post articles, such as blogs or news sites, typically offer a listing of their new content in the form of a document feed. Several common feed formats exist. One of the most popular is RSS, which stands for (take your pick):

Rich Site Summary
Really Simple Syndication
RDF Site Summary
…?

Document Feeds

http://www.cnn.com/services/rss/

SLIDE 4

RSS is an XML format for document listings. RSS files are obtained just like web pages, with HTTP GET requests. The ttl field provides an amount of time (in minutes) that the contents should be cached. RSS feeds are very useful for efficiently managing freshness of news and blog content.

RSS Format

RSS Example

SLIDE 5

Many web pages are generated from structured data in databases, which can be useful for search engines and

ther crawled document collections.

Several schemas exist for web authors to publish their structured data for these tools. The WHATWG web specification working group has produced several standard formats for this data, such as microdata embedded in HTML.

Structured Data

Source: http://en.wikipedia.org/wiki/Microdata_(HTML)

SLIDE 6

The main web ontology is published at schema.org. These schemas are used to annotate web pages for automated information extraction tools. As the published information is not necessarily authoritative, the data needs to be carefully validated for quality and spam removal.

Web Ontologies

Popular schema.org entities

SLIDE 7

In addition to the obvious content for human readers, the web contains a great deal of structured content for use in automated systems.

Document feeds are an important way to manage freshness at some
f the most frequently-updated web sites.
Much of the structured data owned by various web entities is

published in a structured format. This can provide signals for relevance, and can also aid in reconstructing structured databases.