Updated data in SPARQL and new SRU target for SUNCAT open data

If you’re interested in SUNCAT data then you’ll know that there’s been a lot of activity with the new SUNCAT service interface (http://suncat.ac.uk/).

There’s also been a lot of activity with the underlying database and some improvements to the data sitting behind that too.

With new update processes in place the Discover EDINA open data has benefitted and a new and up to date set of open data is now available on the SRU target, with a shiny new permanent base URL too!

The DiscoverEDINA SUNCAT Open Data SRU target is now based at

http://m2m.edina.ac.uk/sru/de_suncat

All the same SRU and CQL queries should run as before, with the updated data, and the addition of multiple records for the same SUNCAT ID (these represent the same journal but from multiple libraries).

Hope you find it useful!

Also see:
SPARQL endpoint for SUNCAT
SUNCAT Open Data – SRU

SUNCAT Open Data – SRU

As part of the /open/ data strand of the SUNCAT bit of Discover EDINA, we have made available the individual library records that we have agreement to release.  At the time of writing this is:

National Library of Scotland, Glasgow University, British Library, Bristol University, Society of Antiquaries of London, Nottingham University.

In order to make these records available, we’ve opted for an SRU target, which is REST-ful.  In the first instance we’re intending users to use the SPARQL interface to run searches (see other post) and use the linked part of the data in the RDF incarnation of the records, and then use the SUNCAT ID to link through to the SRU target to extract the full MARC record (in most cases) should that be needed.

Since the target is a full blown SRU server there are actually a plethora of indices which are made over the MARC-XML records, but the one we anticipate being used most is that for the SUNCAT ID.  However, users are welcome to use the other indexes which will be detailed below.

In the first instance, the DiscoverEDINA SUNCAT SRU target can be found at

http://suncatdev.edina.ac.uk:31001/de_suncat

[EDIT 2014-05-13. The above URL should work but it is now preferred to use

http://m2m.edina.ac.uk/sru/de_suncat ]

so in order to get the MARC-XML format of a record with SUNCAT ID of “SC00374927310” you should send a CQL query of sc.id=SC00374927310 which goes into an SRU searchRetrieve request as:

http://suncatdev.edina.ac.uk:31001/de_suncat?operation=searchRetrieve&version=1.1&startRecord=1&maximumRecords=1&query=sc.id%3DSC00374927310

Remember that the number of records released under the Open Data umbrella is limited, so you won’t find every SUNCAT ID here, but you will find every one that’s in the SPARQL endpoint.

The response will be a bunch of XML that is an SRU Response, and it may contain records (about the same item) from multiple libraries. These records can be found in the Xpath zs:searchRetrieveResponse/za:records/zs:record/zs:recordData. The number of records found is always sent in the zs:searchRetrieveResponse/zs:numberOfRecords element and you can specify which and how many records to retrieve by varying the startRecord and maximumRecords parameters in the HTTP query string.

By default, records will be returned in MARC-XML, with the exception of British Library records, which (due to licensing issues) will always be returned in the RDF transformed version of the record.

Okay, so that’s the basics of grabbing a full MARC-XML record with a SUNCAT ID.  Now for the fun stuff (I’m using ‘fun’ in quite a broad sense of the word).

You can grab a (non-BL) record in five (yes, five) different XML schemata!  To do so, just append the parameter recordSchema=X where X is one of marc (also the default), rdf, mods, mads, dc.  This transforms the MARC-XML into one of the other formats using an XSLT transform.  The rdf one was created in our previous project, and the mods, mads and dc ones are from Indexdata’s zebra software (freely available from http://www.indexdata.com/zebra).  These are relatively simple but might be useful.

Even more fun: obviously we’re making the records search-and-retriev-able on the SUNCAT ID since the perceived workflow is to use SPARQL to query the SPARQL endpoint, obtain the links in the RDF records (including a SUNCAT ID), use that SUNCAT ID to obtain the full records of anything you’re interested in from the SRU server.  However, since this is a full-blown SRU server, we’ve actually got a full set of indexes, and you can use any valid CQL query combining the lot of them!

These indexes are designed to be as close as possible to the existing SUNCAT service Z39.50 target indexes.  In the SRU server some are prefixed with the “bib1“namespace and the rest with the “sc” namespace.  Here is a table of the bib1 indexes and their equivalent Z39.50 BIB-1 index:

bib1.date/time-last-modified = Date/time-last-modified
bib1.lc-card-number = LC-card-number
bib1.isbn = ISBN
bib1.number-music-publisher = Number-music-publisher
bib1.name = Name
bib1.author = Author
bib1.author-name-personal = Author-name-personal
bib1.dewey-classification = Dewey-classification
bib1.issn = ISSN
bib1.lc-call-number = LC-call-number
bib1.nlm-call-number = NLM-call-number
bib1.place-publication = Place-publication
bib1.publisher = Publisher
bib1.title-series = Title-series
bib1.identifier-standard = Identifier-standard
bib1.subject-heading = Subject-heading
bib1.number-govt-pub = Number-govt-pub
bib1.title = Title
bib1.any = Any
bib1.server-choice = Server-choice
bib1.date = Date
bib1.date-of-publication = Date-of-publication
bib1.title = Title
bib1.name = Name
bib1.author = Author
bib1.author-name-personal = Author-name-personal
bib1.title-uniform = Title-uniform
bib1.code-institution = Code-institution
bib1.note = Note
bib1.code-language = Code-language
bib1.publisher = Publisher
bib1.place-publication = Place-publication
bib1.code-geographic = Code-geographic
bib1.subject-heading = Subject-heading

These are the sc ones mapped to their equivalent SUNCAT service index, which are not well documented here and some will be duplicates of the bib1 indexes, but you’re free to play!  Almost certainly the mainly useful two are the SUNCAT ID index, SC_ID and the contributing library code index, SC_WIS.  The values for SC_WIS can be:

StEdNL (National Library of Scotland)
StGlU (Glasgow University)
Uk (British Library)
UkBrU-I (Bristol University)
UkLSAL (Society of Antiquaries of London)
UkNtU (Nottingham University)

Here are all the other sc indexes:

sc.id = SC_ID
sc.005 = SC_005
sc.010 = SC_010
sc.020 = SC_020
sc.022 = SC_022
sc.028 = SC_028
sc.035 = SC_035
sc.049 = SC_049
sc.aut = SC_AUT
sc.awt = SC_AWT
sc.ddc = SC_DDC
sc.gvd = SC_GVD
sc.ismn = SC_ISMN
sc.issn = SC_ISSN
sc.lcc = SC_LCC
sc.nlm = SC_NLM
sc.pla = SC_PLA
sc.pub = SC_PUB
sc.sbd = SC_SBD
sc.sgn = SC_SGN
sc.sici = SC_SICI
sc.sid = SC_SID
sc.srs = SC_SRS
sc.ssn = SC_SSN
sc.stidn = SC_STIDN
sc.stmd = SC_STMD
sc.sub = SC_SUB
sc.sud = SC_SUD
sc.sul = SC_SUL
sc.sum = SC_SUM
sc.tit = SC_TIT
sc.ttl = SC_TTL
sc.wrd = SC_WRD
sc.wyr = SC_WYR
sc.wti = SC_WTI
sc.wau = SC_WAU
sc.wut = SC_WUT
sc.wur = SC_WUR
sc.wnc = SC_WNC
sc.wfm = SC_WFM
sc.wtp = SC_WTP
sc.wgo = SC_WGO
sc.wct = SC_WCT
sc.wid = SC_WID
sc.wsd = SC_WSD
sc.ntl = SC_NTL
sc.wis = SC_WIS
sc.wst = SC_WST
sc.wuc = SC_WUC
sc.wucx = SC_WUCX
sc.wuco = SC_WUCO
sc.wno = SC_WNO
sc.wln = SC_WLN
sc.wpu = SC_WPU
sc.wpl = SC_WPL
sc.wsrs1 = SC_WSRS1
sc.wsrs2 = SC_WSRS2
sc.wga = SC_WGA
sc.wsu = SC_WSU
sc.wsm = SC_WSM

SUNCAT open data

First problem: getting permission from contributing libraries to allow their data to be re-distributed.  Fortunately for me that’s not my problem, and some sterling work from other members of the team has allowed some data to be released without strings.

Libraries who allow some of their data out into the wild usually have a stipulation that it can be any record they’ve contributed that doesn’t originate from such-and-such source, or has been created by them, or similar.

In practice, this means using records from particular libraries that have a particular library code in 040$a or don’t have a particular code in 035$a.  These types of rules could be added automatically at a live filtering stage, but in order to be utterly sure nothing untoward is being released we have chosen to extract those data and build a separate database from those alone.

So, once you get past the problem of libraries allowing their data to be distributed freely (which we haven’t 😉 ) you then need to allow clients to usefully connect and retrieve the data.  Two approaches are being taken for this.

The first, is to produce an SRU target onto the database of (permitted) records.  We have a lot of experience with IndexData’s open source Zebra product which is a database and Z39.50/SRU frontend all in one.  It can be quite fiddly to configure (which is where the experience comes in handy!) but its performance (speed and reliability) is excellent.  It also allows multiple output formats for the records using XSLT.

One of the most useful outcomes from the Linked Data Focus project was an XSLT produced by Will Waites that converts MARC-XML into RDF/XML.  We can use this as one of the outputs from the SRU target, alongside MARC-XML (although some libraries have a requirement that their records not be released in MARC-XML, in which case the XSLT just blanks these records when requested in MARC-XML), a rudimentary MODS transformation, and a JSON transformation might be a possibility too.

Perhaps more usefully for the RDF/XML data, the second approach is to feed these into a SPARQL endpoint.  This should allow anyone interested in the linked data RDF to query in a language more familiar to the linked data world.

We’ll be providing more information on how to connect to the SRU target and the SPARQL endpoint once we’ve polished them up a bit for you.