Publishing Web Data

How to Publish Web Data for Effective Discovery and Synchronization

Publishing RDF or RDFa has limited value unless your data can be easily discovered and efficiently kept in sync by remote databases. This article by the Sindice Team describes the practical steps content providers can take to begin publishing effectively into the web of data.

  1. Publish data as machine readable pages
  2. Enable effective Discovery and Synchronization
  3. Tell Sindice about your site
  4. Check that your site has been Discovered and Synchronized

1. Publish your data using RDFa, RDF or Microformats 

There are several options to publish data using semantic standards. For many website owners, the easiest is to embed the extra markup directly into the HTML pages by using RDFa or Microformats. In this way the same page can read by people and parsed by computers; for most publishers and content management systems this simply means adding the markup support into the current HTML templates.

An alternative to embedding semantic data into your pages is to publish two formats of the content: one format for people, and a different one computers. This is done by supporting content negotiation as described in the tutorial How to publish Linked Data on the Web.  In supporting semantic content negotiation, a web server which normally provides HTML sends RDF format if the HTTP request Accept header contains 'application/rdf+xml'. Sindice uses an Accept header similar to application/rdf+xml, text/turtle, text/n3, application/xhtml+xml; q=0.9, text/html; q=0.8, text/plain; q=0.6, */*. Content negotiation allows people to view the standard HTML pages, while semantic engines like Sindice receive the corresponding semantic data directly.

Content providers who wish simply to provide a large RDF dataset dump can do so by following the procedures defined for the Semantic Sitemap Extension. While this format is less useful for frequently changing data, it is sometimes an appropriate choice for large datasets.

Sindice, the semantic web index at www.sindice.com supports all of the above methods. 

For help in publishing great data, be sure you visit the Sindice Web Data Inspector. The Web Data Inspector will assist you by providing interactive data visualization and validation services.

2. Enable effective Discovery and Synchronization

You are now ready to enable your website for effective discovery and synchronization. Your goal is to allow Sindice and other engines to discover what is new or recently changed on your site in an efficient and timely manner.

If you have exposed your semantic data by embedding it into your web pages using RDFa or Microformats, or by supporting content negotiation, the best ways to let other systems know about your changes are to provide a sitemap with time indications, or to provide RSS or Atom feeds. See sections 2.1 and 2.2 below.

If you have exposed your semantic data by RDF dumps, publish your dataset using the semantic sitemap extension. See section 2.3 below.

Also you can send notification of changes directly to Sindice using our PING interface. See section 2.4 below.

Whether you publish your semantic content using a standard sitemap, an RSS or Atom feed, or a semantic sitemap, be sure to list the sitemap in your robots.txt file. A site's robot.txt is the first place Sindice and many other engines will look when trying to discover your site. Add a line like the following into your robots.txt file:

Sitemap: http://www.example.com/sitemap.xml

2.1 Include time indications in your Sitemap

Sitemaps are the standard way to let crawlers know about the pages on your website. When sitemaps provide time indications using lastmod, changefreq and priority fields, they can be used effectively to have Sindice and others download only new and changed pages.

Sitemaps are usually named "sitemap.xml" and should be advertised in the server's robots.txt file. The sitemap lists the URLs of your pages which either containing embedded metadata or are supported by content negotiation. Since the sitemaps will also be used also by conventional HTML crawlers they should not include direct links to pure RDF representation of your pages.

A sample sitemap can be seen below.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc>
      <changefreq>weekly</changefreq>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=73&desc=vacation_new_zealand</loc>
      <changefreq>weekly</changefreq>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=74&desc=vacation_newfoundland</loc>
      <lastmod>2004-12-23T18:00:15+00:00</lastmod>
      <priority>0.3</priority>
   </url>
   <url>
      <loc>http://www.example.com/catalog?item=83&desc=vacation_usa</loc>
      <lastmod>2004-11-23</lastmod>
   </url>
</urlset>

The sitemap contains a set of URLs, and for each URL three pieces of information can be provided: the date and time of the last modification, the change frequency and the priority.

  • Sindice uses the lastmod field to decide if the given URL has to be re-indexed or not. This is probably the most important information, as this can reduce the number of requests Sindice will to make to your site. If you provide lastmod for every listed URL you can expect Sindice to fetch only the sitemap.xml file (e.g. daily) and then fetch only those URLs which have been modified.
  • The changefreq field is used to determine how often the sitemap containing the given URL needs to be fetched. The sitemap will be fetched with the highest frequency indicated by the URLs contained in it. Because of this, it can save bandwidth if items with the same change frequency are grouped into separate sitemaps. When a lastmod is not available it is used to decide if the page is to be fetched or not.
  • The priority field determines priority of this URI relative to other URIs on your site. This is a number between 0 and 1, and it helps Sindice to decided which pages should be fetched if it has not enough resources to fetch them all. The default priority is 0.5, and higher priority pages will get more attention from Sindice (and other indexing engines).

Sindice supports sitemap indexes as well as compressed sitemaps. Using compression is strongly suggested for large sitemaps. Refer to the the sitemap standard for full detail at www.sitemaps.org.

2.2 Provide RSS or Atom feeds

RSS and Atom web feeds are a standard way of publishing the list of pages recently changed on your site. If you have a web feed which points to content containing embedded metadata or supporting content negotiation, Sindice can follow that feed with no further changes required. To have Sindice follow your RSS feeds, list them as sitemaps in your robots.txt file. Sindice will fetch such RSS feeds as frequently as each hour, and will index the new pages there listed.

Sitemap: http://www.example.com/mysite.rss

2.3 Using Semantic sitemaps

A semantic sitemap is a sitemap extension to be used when your Web site provides dataset DUMPS (e.g. the content as a single RDF file) and to expose further semantic service such as for example SPARQL endpoints.  To have Sindice index your whole dataset this is the best option. Semantic sitemaps should be advertised in the server's robots.txt file exactly as plain old sitemaps should.

A simple example is the following.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd">
  <sc:dataset>
    <sc:datasetLabel>Example Corp. Product Catalog</sc:datasetLabel>
    <sc:datasetURI>http://example.com/catalog.rdf#catalog</sc:datasetURI>
    <lastmod>2005-01-01</lastmod>
    <sc:linkedDataPrefix slicing="subject-object">http://example.com/products/</sc:linkedDataPrefix>
    <sc:sampleURI>http://example.com/products/widgets/X42</sc:sampleURI>
    <sc:sampleURI>http://example.com/products/categories/all</sc:sampleURI>
    <sc:sparqlEndpointLocation slicing="subject-object">http://example.com/sparql</sc:sparqlEndpointLocation>
    <sc:dataDumpLocation>http://example.com/data/catalogdump.rdf.gz</sc:dataDumpLocation>
    <sc:dataDumpLocation>http://example.org/data/catalog_archive.rdf.gz</sc:dataDumpLocation>
    <sc:dataDumpLocation>http://example.org/data/product_categories.rdf.gz</sc:dataDumpLocation>
    <lastmod>2004-12-23T18:00:15+00:00</lastmod>
    <changefreq>weekly</changefreq>
  </sc:dataset>
</urlset>


If provided with a semantic sitemap, Sindice will load and index the dumps with the frequency indicated by looking at the <lastmod> element or, if not present, at the <changefreq> element.

Semantic Sitemap are efficient to index large datasets because they require no crawling on your site. However, since they do not support the incremental/differential updates, we do not recomend the use of semantic sitemaps for large datasets subject to frequent change. Instead consider providing a standard sitemap with time indications as described earlier.

2.4 Sending automatic notification to Sindice

To have individual pages from your site quickly updated into Sindice, your site can send automatic notification using the Sindice Ping API. This is as simple as sending a HTTP POST Request with the URL of the page to be indexed.

curl -H "Accept: text/plain" --data-binary 'http://www.example.com/mypage.html' http://sindice.com/api/v2/ping

Sindice pings are handled with high priority, allowing your pages to be refreshed in our index as quickly as within 20 minutes. Our fetcher will respect politeness when downloading the pages from your site. Please also be polite if submitting pages automatically to our Ping API.

Visit the documentation page for full details on using the Sindice Ping API.

3. Tell Sindice about your site

Once you are ready with any of the above methods, simply use the online form to send a link of your sitemap or to PING us any page containing embedded metadata or supporting content negotiation.

Once Sindice has discovered your sitemap we will do a sample analysis of your site (up to 1000 pages) to determine whether you are publishing effectively into the semantic web of data. If you are, Sindice will begin the synchronization process. 

4. Check that your site has been Discovered and Synchronized

A day or even some hours after submitting your site, you will can search Sindice to find excactly how many of your pages have been indexed. Substitute your domain into the following searches to find which pages have been indexed today or over the past week:

http://sindice.com/search?q=date:today+domain:www.example.com

http://sindice.com/search?q=date:last_week+domain:www.example.com

Please notice that if your sitemap contains a large number of URLs, it will take time for Sindice to download them all due to crawler politeness policies which will usually limit the fetching to a sustainable rate (the default rate is one document per second, but this can be modified by using the Crawl-delay directive in your robots.txt file).

To check that you are effectively synchronized, modify one of your records and recreate your sitemap, updating the lastmod element for that URL. Sindice should have that page re-indexed within the next 24 hours.

If you have comments, questions or encounter problems publishing into the web of data for effective discovery and synchronization, we'd love to hear from you at the Sindice Developers discussion group. Find us at http://groups.google.com/group/sindice-dev/topics.