Sindice now supports Efficient Data discovery and Sync

So far semantic web search engines and semantic aggregation services have been inserting datasets by hand or have been based on “random walk” like crawls with no data completeness or freshness guarantees.

After quite some work, we are happy to announce that Sindice is now supporting effective large scale data acquisition with *efficient syncing* capabilities based on already existing standards (a specific use of  the sitemap protocol).

For example if you publish 300000 products using RDFa or whatever you want to use (microformats,  303s etc), by making sure you comply to the proposed method, Sindice will now guarantee you

a) to crawl your dataset completely (might take some time since we do this “politely”)

b) ..but only crawl you once and then get just the updated URLs on a daily bases! (so timely data update guarantee)

So this is not “Crawling” anymore, but rather a live “DB like” connection between remote, diverse dataset all based on http. in our opinion this is a *very* important step forward for semantic web data aggregation infrastructures.

The specification we support (and how to make sure you’re being properly indexed) are published here  (pretty simple stuff actually!)

http://sindice.com/developers/publishing

and results can be seen from websites which are already implementing these (you might be already doing that indeed without knowing..)

Why not make sure that your site can be effectively kept in sync today?

As always  we look forward for comments, suggestions and ideas on how to serve better your data needs (e.g. yes, we’ll also support Openlink dataset sync proposal once the specs are finalized). Feel free to ask specific questions about this or any other Sindice related issue on our dev forum http://sindice.com/main/forum

Giovanni,

on behalf of the team http://sindice.com/main/about. Special credits for this to Tamas Benko and Robert Fuller.

p.s. we’re still interested in hiring selected researchers and developers

Post filed under Announcements, Sindice.

One comment

  1. Comment by Giovanni Tummarello  

    I have had a few questions so i post them here as well.
    “what is more precisely the difference with crawling” ?

    Crawling (vs fetching a list) implies following links. Here on the other hand no A links are ever followed. Sindice data synchronization is site stateful, here is the way it works:

    a) discovery phase .. where a site is analyzed to see if it would support the efficient sync or which level of it (e.g. sitemaps can have “update frequency” and it would work anyway
    b) there is a stateful syncronization process,
    b1) first the site is acquired entirely with politeness looking at the URLs pointed at in the sitemaps
    b2) once “fetched all”, then the sitemap only is fetched every 24h, and only the “last modified” > last time we fetched URLs are downloaded. (and those marked with a change frequency)
    c) also RSS and Atoms are used if listed in the sitemap.

    this feat is technically demanding due to several elements e.g. the overall “sustainable” processing makes the operations highly parallel, the variety of forms and sizes sitemaps can have require handling of extreme sizes (hundreds of megabytes of nested gzipped files) in parallel in automatic etc.

    Notice that there is some crawling, but it is restricted to something to discover new sites (therefore stopping after few pages) and to crawl pure RDF style.

    The specs are also good source of more infos on what to expect and how to do things like having your site reindexed from scratch.

About this blog

In this blog you'll find announcements related to Sindice project, as well as news about Semantic Web topics or technical issues strictly related to the search engine.

Categories