CS6200: Information Retrieval
Slides by: Jesse Anderton
Crawling Structured Data
Crawling, session 10
Crawling Structured Data Crawling, session 10 CS6200: Information - - PowerPoint PPT Presentation
Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse Anderton Structured Web Data In addition to unstructured document contents, a great deal of structured data exists on the web. Well focus here on
CS6200: Information Retrieval
Slides by: Jesse Anderton
Crawling, session 10
In addition to unstructured document contents, a great deal of structured data exists on the web. We’ll focus here on two types:
properties of objects on their site
Sites which post articles, such as blogs or news sites, typically offer a listing of their new content in the form of a document feed. Several common feed formats exist. One of the most popular is RSS, which stands for (take your pick):
http://www.cnn.com/services/rss/
RSS is an XML format for document listings. RSS files are obtained just like web pages, with HTTP GET requests. The ttl field provides an amount of time (in minutes) that the contents should be cached. RSS feeds are very useful for efficiently managing freshness of news and blog content.
RSS Example
Many web pages are generated from structured data in databases, which can be useful for search engines and
Several schemas exist for web authors to publish their structured data for these tools. The WHATWG web specification working group has produced several standard formats for this data, such as microdata embedded in HTML.
Source: http://en.wikipedia.org/wiki/Microdata_(HTML)
The main web ontology is published at schema.org. These schemas are used to annotate web pages for automated information extraction tools. As the published information is not necessarily authoritative, the data needs to be carefully validated for quality and spam removal.
Popular schema.org entities
In addition to the obvious content for human readers, the web contains a great deal of structured content for use in automated systems.
published in a structured format. This can provide signals for relevance, and can also aid in reconstructing structured databases.