Updated data in SPARQL and new SRU target for SUNCAT open data

If you’re interested in SUNCAT data then you’ll know that there’s been a lot of activity with the new SUNCAT service interface (http://suncat.ac.uk/).

There’s also been a lot of activity with the underlying database and some improvements to the data sitting behind that too.

With new update processes in place the Discover EDINA open data has benefitted and a new and up to date set of open data is now available on the SRU target, with a shiny new permanent base URL too!

The DiscoverEDINA SUNCAT Open Data SRU target is now based at

http://m2m.edina.ac.uk/sru/de_suncat

All the same SRU and CQL queries should run as before, with the updated data, and the addition of multiple records for the same SUNCAT ID (these represent the same journal but from multiple libraries).

Hope you find it useful!

Also see:
SPARQL endpoint for SUNCAT
SUNCAT Open Data – SRU

SPARQL endpoint for SUNCAT

As we explored how to extend access to the metadata contributed by a set of libraries using the SUNCAT service in order to promote discovery and reuse of the data, it soon became clear that Linked Data was one of the preferred format to enable this.

The previous phase of this project developed a transformation to express the information on holdings in a RDF model. The XSLT produced converts MARC-XML into RDF/XML. This XSLT transformation was used to process over 1,000,000 holdings records made available by the British Library, the National Library of Scotland, the University of Bristol Library, the University of Nottingham Library, the University of Glasgow Library and the library of the Society of Antiquaries of London in order to make them available through a Linked Data SPARLQ endpoint interface.

Setting up the Triplestore

We build on previous experience at EDINA on providing SPARQL endpoints to set up the interface for the SUNCAT Linked Data.

We chose the 4Store application which is fully open source, efficient, scalable, and provides a stable RDF database. Our experience is that it is also simpler to install than other products. We installed 4Store on an independent host in order to keep this application separate from other services for security and easy maintenance.

Loading the data

The data contributed by each library was processed separately. First, the data was extracted from SUNCAT following any given restrictions placed by the specific library. It was then transformed into RDF/XML and finally loaded in the triplestore. Each of these steps can be fairly time consuming according to the size of the data file. Once the data from each library has been added to the triplestore, queries can be made accross the whole RDF database.

APIs

A HTTP server is required to provide external acces and allow querying of the triplestore. 4Store includes a simple SPARQL HTTP protocol server which answers SPARQL 1.1 queries. Once the server is running, you can query the triplestore using:

  1. A machine to machine  API at http://sparql1.edina.ac.uk:8181/sparql/.
  2. A basic GUI is available at: http://sparql1.edina.ac.uk:8181/test/. 

GUI

The functionality of the basic test GUI is rather limited and only enables SELECT, CONSTRUCT, ASK and DESCRIBE operations. In order to customise the interface and provide additional information like example queries, we used an open source SPARQL frontend designed by Dave Challis called SPARQLfront and available on github. SPARQLfront is a PHP and Javascript based frontend and can be installed on top of a default Apache2/PHP server. It supports SPARQL 1.0.

An improved GUI is available at: http://sparql1.edina.ac.uk:8181/endpoint/.

The DiscoverEDINA SUNCAT SPARQL endpoint GUI provides four sample queries to help the user with the format and syntax required to compose correct SPARQL queries. For example, one of the queries is:

Is the following title (i.e. archaeological reports) held anywhere in the UK? 

SELECT ?title ?holder
WHERE {
        ?j foaf:primaryTopic ?pt.
        ?pt dc:title ?title;
            lh:held ?h.
        ?h lh:holder ?holder.

        FILTER regex(str(?title), "archaeological reports", "i")
      }

The user is provided with a box in which to enter queries. Syntax highlight is provided to help with composition.  The user can also select whether to display the namespaces in the box or not. There is a range of output formats that can be selected:

  • SPARQL XML (the default)
  • JSON
  • Plain text
  • Serialized PHP
  • Turtle
  • RDF/XML
  • Query structure
  • HTML table
  • Tab Separated Values (TSV)
  • Comma Separated Values (CSV)
  • SQLite database

The SPARQL endpoint GUI is ideal for running interactive queries, developing or troubleshooting queries to be run by the m2m SPARQL API or used in conjunction with the SRU target.

Making records from the SUNCAT database openly available: the experience with licensing

The background is explained in an earlier post (July 10 2012).  SUNCAT (Serials UNion CATalogue) aggregates the metadata (bibliographical and holdings information) for serials, no matter the physical format held in (currently) 89 libraries and it was planned (with the agreement of the Contributing Libraries) to make as much of this data as openly available as possible.

It was decided to adopt an opt in policy.  This approach was taken since it was felt that CLs needed to be fully aware of the commitment they were making and to have the opportunity to place any particular restrictions such as limiting the data which could be made open or restricting the number of formats in which the data would be made available.  In the event most of the participants availed themselves of the opportunity to specify, unambiguously, which data they were agreeing to being made open.

Legal advice was taken from the University solicitors and the licence format adopted was Open Data Commons Public Domain Dedication and Licence with reference to the ODC Attribution Share Alike Community Norms.  Staff in quite a number of institutions expressed interest but, in the event, only staff in 6 institutions proceeded as far as signing a licence with EDINA . A copy of the standard agreement may be viewed here.

Since many libraries have acquired some of the metadata records they use in OPACs from one or more third party commercial suppliers, there were very understandable concerns about giving permission for EDINA to make records from these sources openly available.  Accordingly, it was necessary to add an Appendix to the individual Agreements, specifying what particular restrictions should be applied.

The situation applying to each of the libraries is as follows:

British Library Permission was given to publish all serials records but they are not to be made available in MARCXML or MARC21 formats.
National Library of Scotland Permission was given to publish as open data, any NLS record that has ‘StEdNL’ in the MARC field  040$a

and

to publish as open data, the title, ISSN number and holdings data for any serials record in their catalogue.

University of Bristol Library Permission was given to publish as open data, any Bristol record that has “UkBrU-I” in the MARC field 040 $a.e.g.,

040   L $$aUkBrU-I

University of Nottingham Library Permission was given to publish as open data, any Nottingham record that has “UkNtU” in the MARC field 040 $a and $c.

e.g., 040   L $$aUkNtU$$cUkNtU

However, if there is an 035 tag identifying a different library, then do not use this record.

e.g.,

035   L $$a(OCoLC)1754614
035   L $$a(SFX)954925250111
035   L $$a(CONSER)sc-84001881-
040   L $$aUkNtU$$cUkNtU

University of Glasgow Library Permission was given to publish as open data any Glasgow record that is not derived from Serials Solutions as indicated in the MARC field 035 $$a (WaSeSS).
The library of the Society of Antiquaries of London. Permission was given to publish as open data  all serials records.

As mentioned above, staff in quite a few other libraries expressed interest in becoming involved but the short timescale of the project meant that there had to be concentration on those libraries able to sign the licence agreement quite quickly.

Subject to the availability of further funding it is planned to continue discussions with those libraries which have expressed interest but were not able to proceed to signing an agreement.

Negotiating the specific requirements for each of the libraries was a time consuming, although necessary process, and there are concerns about the resources which would be required to carry the negotiation for a rather larger number of libraries than participated in this phase.

Taken together the records which can be made openly available total in excess of 1,000,000; a considerable quantity of serials’ metadata.  Once the data has been released it will be most interesting to monitor the usages made of it.

Details about making the data openly available and the ways in which developers and others can access it are outlined in a separate blog entry.

That library staff have concerns about making available metadata which has been obtained from one or other third party has been well recognised for some time but to date there has been very little progress on resolving these issues at either a national or international level.  In the earlier blog post it was stated that:

“A number of librarians said that it would be a good idea if JISC/EDINA could come to an agreement with organisations such as OCLC and RLUK rather than individual libraries needing to approach them; this is an idea certainly worth pursuing”.

 JISC did commission work to be carried out in this area and there is a website available which provides guidance.  Whilst, clearly, this is very helpful the onus is placed upon staff in individual libraries to look carefully at their licence agreements with third party suppliers: even where this is done what is often found is that the licence agreements are not necessarily clear and unambiguous on what is possible and what is not.

RLUK recently commissioned work to scope the parameters of making RLUK data openly available and the results of that work should make helpful reading even if the focus is just on material in the RLUK database.

It certainly would be of considerable benefit to the HE community as a whole if national bodies including the JISC, SCONUL and RLUK could accept responsibility for initiating discussions with third party suppliers of records with a view to negotiating removing all restrictions on making metadata openly available.  Such an approach would remove the need for individual libraries to investigate their specific local circumstances and would be of enormous potential benefit to the user community.

SUNCAT Open Data – SRU

As part of the /open/ data strand of the SUNCAT bit of Discover EDINA, we have made available the individual library records that we have agreement to release.  At the time of writing this is:

National Library of Scotland, Glasgow University, British Library, Bristol University, Society of Antiquaries of London, Nottingham University.

In order to make these records available, we’ve opted for an SRU target, which is REST-ful.  In the first instance we’re intending users to use the SPARQL interface to run searches (see other post) and use the linked part of the data in the RDF incarnation of the records, and then use the SUNCAT ID to link through to the SRU target to extract the full MARC record (in most cases) should that be needed.

Since the target is a full blown SRU server there are actually a plethora of indices which are made over the MARC-XML records, but the one we anticipate being used most is that for the SUNCAT ID.  However, users are welcome to use the other indexes which will be detailed below.

In the first instance, the DiscoverEDINA SUNCAT SRU target can be found at

http://suncatdev.edina.ac.uk:31001/de_suncat

[EDIT 2014-05-13. The above URL should work but it is now preferred to use

http://m2m.edina.ac.uk/sru/de_suncat ]

so in order to get the MARC-XML format of a record with SUNCAT ID of “SC00374927310” you should send a CQL query of sc.id=SC00374927310 which goes into an SRU searchRetrieve request as:

http://suncatdev.edina.ac.uk:31001/de_suncat?operation=searchRetrieve&version=1.1&startRecord=1&maximumRecords=1&query=sc.id%3DSC00374927310

Remember that the number of records released under the Open Data umbrella is limited, so you won’t find every SUNCAT ID here, but you will find every one that’s in the SPARQL endpoint.

The response will be a bunch of XML that is an SRU Response, and it may contain records (about the same item) from multiple libraries. These records can be found in the Xpath zs:searchRetrieveResponse/za:records/zs:record/zs:recordData. The number of records found is always sent in the zs:searchRetrieveResponse/zs:numberOfRecords element and you can specify which and how many records to retrieve by varying the startRecord and maximumRecords parameters in the HTTP query string.

By default, records will be returned in MARC-XML, with the exception of British Library records, which (due to licensing issues) will always be returned in the RDF transformed version of the record.

Okay, so that’s the basics of grabbing a full MARC-XML record with a SUNCAT ID.  Now for the fun stuff (I’m using ‘fun’ in quite a broad sense of the word).

You can grab a (non-BL) record in five (yes, five) different XML schemata!  To do so, just append the parameter recordSchema=X where X is one of marc (also the default), rdf, mods, mads, dc.  This transforms the MARC-XML into one of the other formats using an XSLT transform.  The rdf one was created in our previous project, and the mods, mads and dc ones are from Indexdata’s zebra software (freely available from http://www.indexdata.com/zebra).  These are relatively simple but might be useful.

Even more fun: obviously we’re making the records search-and-retriev-able on the SUNCAT ID since the perceived workflow is to use SPARQL to query the SPARQL endpoint, obtain the links in the RDF records (including a SUNCAT ID), use that SUNCAT ID to obtain the full records of anything you’re interested in from the SRU server.  However, since this is a full-blown SRU server, we’ve actually got a full set of indexes, and you can use any valid CQL query combining the lot of them!

These indexes are designed to be as close as possible to the existing SUNCAT service Z39.50 target indexes.  In the SRU server some are prefixed with the “bib1“namespace and the rest with the “sc” namespace.  Here is a table of the bib1 indexes and their equivalent Z39.50 BIB-1 index:

bib1.date/time-last-modified = Date/time-last-modified
bib1.lc-card-number = LC-card-number
bib1.isbn = ISBN
bib1.number-music-publisher = Number-music-publisher
bib1.name = Name
bib1.author = Author
bib1.author-name-personal = Author-name-personal
bib1.dewey-classification = Dewey-classification
bib1.issn = ISSN
bib1.lc-call-number = LC-call-number
bib1.nlm-call-number = NLM-call-number
bib1.place-publication = Place-publication
bib1.publisher = Publisher
bib1.title-series = Title-series
bib1.identifier-standard = Identifier-standard
bib1.subject-heading = Subject-heading
bib1.number-govt-pub = Number-govt-pub
bib1.title = Title
bib1.any = Any
bib1.server-choice = Server-choice
bib1.date = Date
bib1.date-of-publication = Date-of-publication
bib1.title = Title
bib1.name = Name
bib1.author = Author
bib1.author-name-personal = Author-name-personal
bib1.title-uniform = Title-uniform
bib1.code-institution = Code-institution
bib1.note = Note
bib1.code-language = Code-language
bib1.publisher = Publisher
bib1.place-publication = Place-publication
bib1.code-geographic = Code-geographic
bib1.subject-heading = Subject-heading

These are the sc ones mapped to their equivalent SUNCAT service index, which are not well documented here and some will be duplicates of the bib1 indexes, but you’re free to play!  Almost certainly the mainly useful two are the SUNCAT ID index, SC_ID and the contributing library code index, SC_WIS.  The values for SC_WIS can be:

StEdNL (National Library of Scotland)
StGlU (Glasgow University)
Uk (British Library)
UkBrU-I (Bristol University)
UkLSAL (Society of Antiquaries of London)
UkNtU (Nottingham University)

Here are all the other sc indexes:

sc.id = SC_ID
sc.005 = SC_005
sc.010 = SC_010
sc.020 = SC_020
sc.022 = SC_022
sc.028 = SC_028
sc.035 = SC_035
sc.049 = SC_049
sc.aut = SC_AUT
sc.awt = SC_AWT
sc.ddc = SC_DDC
sc.gvd = SC_GVD
sc.ismn = SC_ISMN
sc.issn = SC_ISSN
sc.lcc = SC_LCC
sc.nlm = SC_NLM
sc.pla = SC_PLA
sc.pub = SC_PUB
sc.sbd = SC_SBD
sc.sgn = SC_SGN
sc.sici = SC_SICI
sc.sid = SC_SID
sc.srs = SC_SRS
sc.ssn = SC_SSN
sc.stidn = SC_STIDN
sc.stmd = SC_STMD
sc.sub = SC_SUB
sc.sud = SC_SUD
sc.sul = SC_SUL
sc.sum = SC_SUM
sc.tit = SC_TIT
sc.ttl = SC_TTL
sc.wrd = SC_WRD
sc.wyr = SC_WYR
sc.wti = SC_WTI
sc.wau = SC_WAU
sc.wut = SC_WUT
sc.wur = SC_WUR
sc.wnc = SC_WNC
sc.wfm = SC_WFM
sc.wtp = SC_WTP
sc.wgo = SC_WGO
sc.wct = SC_WCT
sc.wid = SC_WID
sc.wsd = SC_WSD
sc.ntl = SC_NTL
sc.wis = SC_WIS
sc.wst = SC_WST
sc.wuc = SC_WUC
sc.wucx = SC_WUCX
sc.wuco = SC_WUCO
sc.wno = SC_WNO
sc.wln = SC_WLN
sc.wpu = SC_WPU
sc.wpl = SC_WPL
sc.wsrs1 = SC_WSRS1
sc.wsrs2 = SC_WSRS2
sc.wga = SC_WGA
sc.wsu = SC_WSU
sc.wsm = SC_WSM

SUNCAT open data

First problem: getting permission from contributing libraries to allow their data to be re-distributed.  Fortunately for me that’s not my problem, and some sterling work from other members of the team has allowed some data to be released without strings.

Libraries who allow some of their data out into the wild usually have a stipulation that it can be any record they’ve contributed that doesn’t originate from such-and-such source, or has been created by them, or similar.

In practice, this means using records from particular libraries that have a particular library code in 040$a or don’t have a particular code in 035$a.  These types of rules could be added automatically at a live filtering stage, but in order to be utterly sure nothing untoward is being released we have chosen to extract those data and build a separate database from those alone.

So, once you get past the problem of libraries allowing their data to be distributed freely (which we haven’t 😉 ) you then need to allow clients to usefully connect and retrieve the data.  Two approaches are being taken for this.

The first, is to produce an SRU target onto the database of (permitted) records.  We have a lot of experience with IndexData’s open source Zebra product which is a database and Z39.50/SRU frontend all in one.  It can be quite fiddly to configure (which is where the experience comes in handy!) but its performance (speed and reliability) is excellent.  It also allows multiple output formats for the records using XSLT.

One of the most useful outcomes from the Linked Data Focus project was an XSLT produced by Will Waites that converts MARC-XML into RDF/XML.  We can use this as one of the outputs from the SRU target, alongside MARC-XML (although some libraries have a requirement that their records not be released in MARC-XML, in which case the XSLT just blanks these records when requested in MARC-XML), a rudimentary MODS transformation, and a JSON transformation might be a possibility too.

Perhaps more usefully for the RDF/XML data, the second approach is to feed these into a SPARQL endpoint.  This should allow anyone interested in the linked data RDF to query in a language more familiar to the linked data world.

We’ll be providing more information on how to connect to the SRU target and the SPARQL endpoint once we’ve polished them up a bit for you.

 

Licensing SUNCAT serials’ records

The reasons for making bibliographic metadata openly available have been well put by JISC in the Open Bibliographic Data Guide and the Open Knowledge Foundation but whilst many librarians are keen to support making their institutional library metadata available there are issues to be resolved. There can be copyright issues and contractual issues over records in library OPACS which inhibit the release of records. The records in many OPACs will have been obtained from one or more third party organisations (e.g. OCLC, British Library, Ex Libris, Serials Solutions) and even though often the records received from these third parties will have been modified, perhaps quite extensively, there are understandable concerns expressed about the possible repercussions of making them available under an open licence.

SUNCAT is an aggregation of serials’ metadata from (currently) 86 libraries (referred to as Contributing Libraries (CLs)). Whilst much of the metadata will have been created by local library staff and will, therefore be ‘owned’ by the library, some of it will have been purchased from a third party supplier. The metadata is essentially supplied to EDINA on the basis of goodwill and a common understanding about how the data is used and made available. EDINA reached agreement with third party record suppliers that records in MARC21 format could be made available for downloading, but only to staff in CLs.

In the initial project SUNCAT: exploring open metadata (funded under the JISC Capital funded RDTF participation) the decision was taken to adopt an ‘opt in’ approach and, accordingly, an invitation was sent to all the CLs inviting them to participate in making their SUNCAT contributed data openly available under an Open Data Commons Public Domain Dedication and Licence with reference to the ODC Attribution Share Alike Community Norms. Considerable interest was expressed by CLs in becoming involved but concerns, particularly to do with making third party records available, were raised. A number of librarians said that it would be a good idea if JISC/EDINA could come to an agreement with organisations such as OCLC and RLUK rather than individual libraries needing to approach them; this is an idea certainly worth pursuing.

Licences have now been signed by three organisations. They are the British Library (BL), the National Library of Scotland (NLS) and the Society of Antiquaries; discussions are well advanced with a number of additional organisations. After discussion with BL staff, it was agreed that it would be preferable to add an Appendix to an existing contract between EDINA and the BL, and this has been done. All the data supplied to EDINA by the BL can be made openly available, provided records are not made available in either MARC21 or MARCXML formats. In the case of the National Library of Scotland permission has been given to make all the fields available of all records which have been created by NLS (identified by the presence of ‘StEdNL’ in the 040$a field) or to make title, ISSN and holdings information available for the whole of the contribution to SUNCAT. The Society of Antiquaries has placed no restrictions on the use of their contributed records.

Glasgow University has asked for records from a third party supplier to be excluded from the records made available for open usage and this will be done.

Work is now being carried out to make the records from the initial three organisations freely available on the basis described in the licences and as other licences are signed by additional organisations more data will be published for open usage.

Supporting Discovery open metadata principles

What we have achieved so far:

SUNCAT received funding from the JISC Discovery Programme Phase 1, from February 1st to July 31st 2011,  to explore what might be done to extend access to the metadata from the contributing libraries already aggregated by EDINA for the SUNCAT service. This included:

  • establishing use cases
  • exploring metadata licensing issues
  • determining what metadata to make available
  • mechanisms for providing access to the metadata

Much was achieved during the 6 month project to extend access to the catalogue, including holdings’ information. The diagrams below describes the data in the scope of the project.

There was agreement from three libraries to use some of their data for open access during the short lifetime of the project. These were: the British Library, the National Library of Scotland and the Society of Antiquaries.

In the process of working on representing data aggregated by SUNCAT from various libraries across the UK in RDF, we found that existing vocabularies for describing bibliographic data are generally missing constructs for dealing with holding statements. As SUNCAT primarily contains information relating to holdings of journals in the contributing libraries, the primary value of this information is clearly in the holding statements. Since we have chosen a relatively flat model of catalogue records, as is natural in the MARC21 source data and is appropriate with the Bibliographic Ontology, there is no obvious way to express this information which might normally go at the Item level were we to use a more elabrate model like FRBR-RDF.

More on how we defined a Library’s holdings can be found here.

What we are going to do:

We are taking forward some of these in this project to further enhance the SUNCAT service:

  • Continue to increase the number of Contributing Libraries involved in the SUNCAT open metadata initiative
  • Implement a filtering mechanism to cater for different data being included in a particular format
  • Where the Contributing Libraries give agreement for release, implement an ‘on the fly’ filtering mechanism for their data
  • Explore provision of other record formats to support use within other applications (e.g. MODS, a simple DC) where use cases were identified
  • Explore further the status of RDF triples (regarding Copyright and database rights) that have derived from data that were part of a database provided by a third party.