Why do JISC MediaHub metadata need to be enhanced?
JISC MediaHub contains around 130,000 images, videos and audio clips licensed by JISC Collections, plus records of another 600,000 harvested from other providers.
A large proportion of these record contain important information related to people, places and dates. Although some of these have been well catalogued, all too often the information exists but only in descriptive plain text. This can happen for a variety of reasons; it may be that the metadata were created from older, text based records that were not originally created with machine indexing in mind; that resources for cataloging records were limited; or that ill-conceived merging into a common metadata schema destroyed valuable elements.
Whatever the cause, the discoverability of records is reduced. Advanced searching and results filtering based on geographic location or dates may fail to display relevant records, where location and date are not indexed. Visualizations such as maps and timelines are also restricted to covering only a subset of all records.
It can also be very confusing to users, and dent their confidence in the service, if they see a date prominently displayed in the title or description of a record, and yet a subsequent search for that date fails to retrieve that record.
Methods for enhancing metadata
There are three obvious approaches that could be used
- Have the records professionally catalogued.
- Use text processing software to parse the metadata and attempt to identify dates, locations and the names of people and places etc.
- Crowd sourcing: a large community of users is likely to contain individuals with the necessary knowledge to contribute information.
The first approach, professional cataloguing, is unfortunately an extremely expensive option.
The second approach, using text processing software, is an efficient method but unfortunately is prone to error. Some information may be missed, and some terms may be falsely marked up. This can make the resulting records confusing to ordinary users who, reasonably enough, expect cataloguing to be definitive rather than a probabilistic expression of what a record is likely to be about.
We think the third approach, crowd sourcing, has great potential with an academic user community. Compared to an average web site, our users are likely to contain an unusually large number of experts, and individuals who are motivated to share knowledge. This approach involves certain risks; if the process for contributing is complex or time consuming, fewer users will take part; user contributed information cannot be assumed to be as trustworthy as “official” metadata, and is prone to malicious or frivolous uses; and whatever review processes exist must be lightweight (otherwise, we might as well do all the cataloguing ourselves).
We will combine two of the approaches described above: text processing software and crowd sourcing.
By processing metadata, we can identify candidate elements for indexing. These will be treated as unreliable, and will not be indexed at this stage. To begin with, we will focus on location information, and use EDINA Unlock (http://unlock.edina.ac.uk/texts/introduction) for text processing.
When we have have candidate location elements for our records, we will present these to users in the JISC MediaHub user interface. This will be incorporated into the existing record display pages, so users find this new functionality as part of their normal usage of the service. Users will be asked simply to confirm or to reject each candidate element. This will should make the process of contributing very simple, and help to maximize users’ involvement. Also, since users are selecting from predefined options rather than entering their own values, it should reduce the scope for abuse. We intend that no complicated reviews should be needed, at least in the vast majority of cases, and instead that we can base the result on a poll of users’ opinions.
When we have a corpus of new metadata, we can index the values to add value to the records. We do not intend that the new metadata should be merged with the original records, as the provenance is important; however within the user interface, we can provide the option to include user contributed location data in searches.