SPARQL endpoint for SUNCAT

As we explored how to extend access to the metadata contributed by a set of libraries using the SUNCAT service in order to promote discovery and reuse of the data, it soon became clear that Linked Data was one of the preferred format to enable this.

The previous phase of this project developed a transformation to express the information on holdings in a RDF model. The XSLT produced converts MARC-XML into RDF/XML. This XSLT transformation was used to process over 1,000,000 holdings records made available by the British Library, the National Library of Scotland, the University of Bristol Library, the University of Nottingham Library, the University of Glasgow Library and the library of the Society of Antiquaries of London in order to make them available through a Linked Data SPARLQ endpoint interface.

Setting up the Triplestore

We build on previous experience at EDINA on providing SPARQL endpoints to set up the interface for the SUNCAT Linked Data.

We chose the 4Store application which is fully open source, efficient, scalable, and provides a stable RDF database. Our experience is that it is also simpler to install than other products. We installed 4Store on an independent host in order to keep this application separate from other services for security and easy maintenance.

Loading the data

The data contributed by each library was processed separately. First, the data was extracted from SUNCAT following any given restrictions placed by the specific library. It was then transformed into RDF/XML and finally loaded in the triplestore. Each of these steps can be fairly time consuming according to the size of the data file. Once the data from each library has been added to the triplestore, queries can be made accross the whole RDF database.

APIs

A HTTP server is required to provide external acces and allow querying of the triplestore. 4Store includes a simple SPARQL HTTP protocol server which answers SPARQL 1.1 queries. Once the server is running, you can query the triplestore using:

  1. A machine to machine  API at http://sparql1.edina.ac.uk:8181/sparql/.
  2. A basic GUI is available at: http://sparql1.edina.ac.uk:8181/test/. 

GUI

The functionality of the basic test GUI is rather limited and only enables SELECT, CONSTRUCT, ASK and DESCRIBE operations. In order to customise the interface and provide additional information like example queries, we used an open source SPARQL frontend designed by Dave Challis called SPARQLfront and available on github. SPARQLfront is a PHP and Javascript based frontend and can be installed on top of a default Apache2/PHP server. It supports SPARQL 1.0.

An improved GUI is available at: http://sparql1.edina.ac.uk:8181/endpoint/.

The DiscoverEDINA SUNCAT SPARQL endpoint GUI provides four sample queries to help the user with the format and syntax required to compose correct SPARQL queries. For example, one of the queries is:

Is the following title (i.e. archaeological reports) held anywhere in the UK? 

SELECT ?title ?holder
WHERE {
        ?j foaf:primaryTopic ?pt.
        ?pt dc:title ?title;
            lh:held ?h.
        ?h lh:holder ?holder.

        FILTER regex(str(?title), "archaeological reports", "i")
      }

The user is provided with a box in which to enter queries. Syntax highlight is provided to help with composition.  The user can also select whether to display the namespaces in the box or not. There is a range of output formats that can be selected:

  • SPARQL XML (the default)
  • JSON
  • Plain text
  • Serialized PHP
  • Turtle
  • RDF/XML
  • Query structure
  • HTML table
  • Tab Separated Values (TSV)
  • Comma Separated Values (CSV)
  • SQLite database

The SPARQL endpoint GUI is ideal for running interactive queries, developing or troubleshooting queries to be run by the m2m SPARQL API or used in conjunction with the SRU target.

Supporting Discovery open metadata principles

What we have achieved so far:

SUNCAT received funding from the JISC Discovery Programme Phase 1, from February 1st to July 31st 2011,  to explore what might be done to extend access to the metadata from the contributing libraries already aggregated by EDINA for the SUNCAT service. This included:

  • establishing use cases
  • exploring metadata licensing issues
  • determining what metadata to make available
  • mechanisms for providing access to the metadata

Much was achieved during the 6 month project to extend access to the catalogue, including holdings’ information. The diagrams below describes the data in the scope of the project.

There was agreement from three libraries to use some of their data for open access during the short lifetime of the project. These were: the British Library, the National Library of Scotland and the Society of Antiquaries.

In the process of working on representing data aggregated by SUNCAT from various libraries across the UK in RDF, we found that existing vocabularies for describing bibliographic data are generally missing constructs for dealing with holding statements. As SUNCAT primarily contains information relating to holdings of journals in the contributing libraries, the primary value of this information is clearly in the holding statements. Since we have chosen a relatively flat model of catalogue records, as is natural in the MARC21 source data and is appropriate with the Bibliographic Ontology, there is no obvious way to express this information which might normally go at the Item level were we to use a more elabrate model like FRBR-RDF.

More on how we defined a Library’s holdings can be found here.

What we are going to do:

We are taking forward some of these in this project to further enhance the SUNCAT service:

  • Continue to increase the number of Contributing Libraries involved in the SUNCAT open metadata initiative
  • Implement a filtering mechanism to cater for different data being included in a particular format
  • Where the Contributing Libraries give agreement for release, implement an ‘on the fly’ filtering mechanism for their data
  • Explore provision of other record formats to support use within other applications (e.g. MODS, a simple DC) where use cases were identified
  • Explore further the status of RDF triples (regarding Copyright and database rights) that have derived from data that were part of a database provided by a third party.