crawling structured data
play

Crawling Structured Data Crawling, session 10 CS6200: Information - PowerPoint PPT Presentation

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse Anderton Structured Web Data In addition to unstructured document contents, a great deal of structured data exists on the web. Well focus here on


  1. Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse Anderton

  2. Structured Web Data In addition to unstructured document contents, a great deal of structured data exists on the web. We’ll focus here on two types: • Document feeds, which sites use to announce their new content • Content metadata, used by web authors to publish structured properties of objects on their site

  3. Document Feeds Sites which post articles, such as blogs or news sites, typically offer a listing of their new content in the form of a document feed. Several common feed formats exist. One of the most popular is RSS, which stands for (take your pick): • Rich Site Summary • Really Simple Syndication • RDF Site Summary http://www.cnn.com/services/rss/ • …?

  4. RSS Format RSS is an XML format for document listings. RSS files are obtained just like web pages, with HTTP GET requests. The ttl field provides an amount of time (in minutes) that the contents should be cached. RSS feeds are very useful for efficiently managing freshness of news and blog content. RSS Example

  5. Structured Data Many web pages are generated from structured data in databases, which can be useful for search engines and other crawled document collections. Several schemas exist for web authors to publish their structured data for these tools. The WHATWG web specification Source: http://en.wikipedia.org/wiki/Microdata_(HTML) working group has produced several standard formats for this data, such as microdata embedded in HTML.

  6. Web Ontologies Popular schema.org entities The main web ontology is published at schema.org. These schemas are used to annotate web pages for automated information extraction tools. As the published information is not necessarily authoritative, the data needs to be carefully validated for quality and spam removal.

  7. Wrapping Up In addition to the obvious content for human readers, the web contains a great deal of structured content for use in automated systems. • Document feeds are an important way to manage freshness at some of the most frequently-updated web sites. • Much of the structured data owned by various web entities is published in a structured format. This can provide signals for relevance, and can also aid in reconstructing structured databases.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend