Metadata extractions

Contents

Status

  • [2011-11-10] Major improvements to the document. Added reference to Any23 external documentation.
    Migrated from Microformats to Metadata document topic.
  • [2008-04-21] Added support for rel-license, hListing, EXFN. Change archive format to accomodate new spec.
  • [2008-07-28] Generally improvements to the document.
  • [2008-08-10] First public release.

Overview

This section describes how Sindice extracts metadata from the Web and which formats are supported.

Metadata parsed out from HTTP resources (HTML, RDF specific formats, CSV, etc) are converted to RDF graphs.
Such graphs are enhanced, stored and indexed by the several Sindice backend components,
which make them available to be queried through several services offered by Sindice like SIREn, the SPARQL endpoint and the Sindice Cache API.
See the full list of the Sindice open API and Tools.

Processing metadata with Any23

The core extraction library used by Sindice is Anything To Triples (Any23), at the latest stable release.

Any23 is a library, a Web service and a set of command line tools written in Java for extracting structured data in RDF format from a variety of Web documents.
Any23 will be maintained in the Google Code repository until the 0.7.0 release, then the development infrastructure will be migrated to the Apache Any23 site.

Any23 Extractors

Any23 supports multiple input and output metadata formats including:

See all the supported Any23 formats.

Any23 Microformats Nesting

Any23 adds specific structural statements to express the nesting relationship of Microformat metadata.
The logic of these statements is documented in documentation section Microformat Nesting.

Verify metadata with Sindice Inspector

The main purpose of the Sindice Inspector is to provide a verification and visualization tool for Semantic Web metadata.
It provides also support for web engineers interested in evaluating third party metadata contents that can be then pinged to Sindice to be used in data mashup scenarios.

The Sindice Inspector uses Any23 to extract metadata, so this versatile tool can used to verify how Sindice reads any data exposed on the Web.
A live demo of the only Any23 Web Service is available here.