Sindice reindexed: find your datasets (much faster)

Having streamlined several procedures inside Sindice, rebuilding the sindice index from scratch now takes just a few hours.

Over the weekend, we built a new Sindice index based on the latest updates of Siren and improvements to the pipelines. This is now in production and sports the following enhancements:

Ranking

  • no more big docs first (sorry guys, this was an issue in the last weeks)
  • properties are weighted differently

Preprocessing

  • Improved support of encoded URIS: Decode encoded characters in URIs prior to indexing. The two versions of the URIs, the decoded and encoded one, are indexed. As a consequence, if you look for an URI (or keyword) with any special characters (e.g., Knud_Möller), this will match also encoded URIs (e.g., http://dblp.l3s.de/d2r/resource/authors/Knud_M%C3%B6ller).
  • Improved tokenisation of URI localname: Previously an URI localname was tokenised, but the non-tokenised version of the localname was not indexed. For example, given the URI http://rdf.data-vocabulary.org/#startDate, we now index as part of the local name: start, date, and startDate (the later one was not possible previously).
  • Improved support of mailto URI: Previously the URI mailtotest@test.com was improperly tokenised and indexed. Now you can search either for mailto:test@test.com or test@test.com. Both will match the mailto URI.

Query Language and processing

  • Improved support of special character in URI. E.g. previously, a tilde ‘~’ in a URI was invalidating the query.
  • Various bug fixes as reported by users.
  • Improved query processing for Ntriple Query. Huge performance benefits in certain situations.

Index Data Structure

  • Improved index compression.

These are the technicals.

In practice the thing I love the most is that my favorite queries(the group by dataset ones = open web dataset finding queries) now work easily 10 times faster :).

As always, follow the development of our core open source Semantic Information Retrieval Engine SIREn on github:

Post filed under Announcements, Sindice.

No comments

No comments for this post.

About this blog

In this blog you'll find announcements related to Sindice project, as well as news about Semantic Web topics or technical issues strictly related to the search engine.

Categories