Minting DOIs for research data in the UK

‘Coin press at New Orleans Mint Museum’ 
AttributionNo Derivative Works Some rights reserved by Ted Drake

Last week’s DataCite workshop was a really good opportunity to ask questions about DataCite at The British Library, how to mint a DOI (Digital Object Identifier), and to discuss challenges with citing research data.

Data Citation

The day started with a challenge to the presenters – what is data? This discussion had echoes of KAPTUR’s own research question – what is visual arts research data? (Environmental Assessment report). It seems almost impossible to define research data due to its diversity, but a working definition is obviously necessary, a good example is from University of Bristol’s Glossary.

The British Library’s Head of Scientific, Technical & Medical Information, Lee-Ann Coleman, spoke about the importance of making research data available, mentioning examples including the virologist Ilaria Capua who opened up worldwide access to Avian flu virus data sequences; and the open-data journal GigaScience research into E.Coli. A recent addition, ISO 26324:2012 for DOIs was mentioned. Garfield’s 15 reasons ‘when/why to cite?’ was a useful point of reference too:

  1. Paying homage to pioneers.
  2. Giving credit for related work (homage to peers).
  3. Identifying methodology, equipment etc.
  4. Providing background reading.
  5. Correcting one’s own work.
  6. Correcting the work of others.
  7. Criticizing previous work.
  8. Substantiating claims.
  9. Alerting researchers to forthcoming work.
  10. Providing leads to poorly disseminated, poorly indexed, or uncited work.
  11. Authenticating data and classes of fact – physical constants, etc.
  12. Identifying original publications in which an idea or concept was discussed.
  13. Identifying the original publication describing an eponymic [sic] concept or term as, e.g., Hodgkin’s disease, Pareto’s Law, Friedel-Crafts Reaction, etc.
  14. Disclaiming work or ideas of others (negative claims).
  15. Disputing priority claims of others (negative homage).

Garfield, E., 1996. When to Cite. In: Library Quarterly 66 (4), 449-458. Available from: http://www.garfield.library.upenn.edu/papers/libquart66(4)p449y1996.pdf [Accessed 25 May 2012].

What is DataCite? 

Elizabeth Newbold provided an introduction to DataCite. It is a not-for-profit international registration agency for DOIs to facilitate the citing of research data. Founded in December 2009; it consists of a Managing Agent (currently the German National Library of Science and Technology (TIB)) and regional Members. In the UK The British Library is the regional Member, which then works with ‘Data Clients’ such as the UK Data Archive amongst other data centres and repositories. DOIs are assigned between the Data Member (e.g. The British Library) and their Data Clients (e.g. UK Data Archive) i.e. on an institution to institution basis – if an individual researcher wants a DOI then they need to contact the appropriate Data Client for their subject discipline, a list of some existing and potential future Data Clients is maintained on the DataCite website. Data Clients must fulfil a number of requirements and pay an annual fee to The British Library.

Some of the requirements for Data Clients:

  • DOIs must resolve to a publically accessible landing page even if the data itself is not open; the landing page can be an existing set of Web pages with the Data Client’s style so long as it is updated to include the DataCite information.
  • Mandatory metadata fields: 4 fields (5 if you include the DOI itself) – these should be subject discipline agnostic: http://schema.datacite.org/
  • The mandatory metadata must be freely available for discovery purposes, specifically under a Creative Commons CC0 licence; there was some interesting discussion around this and some issues to be resolved.
  • Data Clients should have a formal data preservation plan (this may include disposal policies and so on); an operational service level agreement (SLA); and a clear intention in a mission statement to preserve and maintain the DOIs, this could include reference to an EPSRC Roadmap. Action: DataCite will share a draft SLA with the attendees.

How to mint a DOI – case study

Louise Corti of the UK Data Archive provided a very useful mini-case study and I’ll link to her presentation here when it is available. As data providers the UK Data Archive want to use citations to improve resource access and discovery. It was really interesting to hear how DOIs are effected by changes to the research data – at the UK Data Archive minor changes (e.g. a spelling mistake or typo) are documented in their Change Log but the DOI version number stays the same; major changes (such as an updated dataset) are documented in the Change Log field and the DOI is also given a new version number at the end. Challenges for the future include citing parts or fragments of research data; and also issues around describing relationships between data. Look out for a forthcoming UK Data Archive and ESRC brochure on citing data, aimed at the Social Science community.

How to mint a DOI – the technical bit

An illuminating presentation from Ed Zukowski described the following components of the DataCite systems:

The Data Client will be provided with information from the regional Member in order to make use of the Metadata Store and facility to mint DOIs, technical knowledge is required to use the API for bulk registration. For minting one DOI an XML file is required with at least the four mandatory fields of metadata using the DataCite Schema.

The user will resolve a DOI (e.g. using a system such as http://dx.doi.org/) through the Global Handle Registry this includes information from the Handle Server hosted by the DataCite Managing Agent. Resolving a DOI takes the user to a landing page and collects statistics about how many times a DOI has been resolved.

There is a free search of existing DataCite DOIs. From the top right of the Search page select ‘Options’ and ‘enable’ the Filter Preview, then when you do a search it is possible to filter by individual regional Member (‘allocator’) and Data Client (‘datacentre’).

The OAI-PMH Data Provider  is available here: http://oai.datacite.org/

http://data.datacite.org/ – provides two ways of exposing metadata held in the Metadata Store:

  1. HTML links i.e. hyperlinks in a standard Web browser.
  2. HTTP Content Negotiation – ‘I say what I want and in what priority’ e.g. ‘I want a PDF version of the research data but if there is a HTML version I’ll take it’ – if there is a PDF version available content negotiation will take you straight to the PDF rather than to the landing page for example.

Contact datasets@bl.uk to ask for access to the test site which enables you to mint ‘temporary’ DOIs. See also: https://github.com/datacite

A really useful tool to format DOIs into Harvard system citations (and other citation systems) in multiple languages: http://crosscite.org/citeproc/

Breakout groups on challenges with citing research data (some questions):

  • Selection process – what about raw data? when does data become citable?
  • Why not use DOIs for Ph.D. theses?
  • Do you need to mint DOIs before you publish the journal article so you can link to them? – could start minting DOIs at collection level then move into additional specific parts nearer to publication of the journal article?
  • A need to define roles and responsibilities.
  • What about changes to Data Clients or funding bodies?
  • How does versioning work with DOIs? (note UK Data Archive case study above)
  • What is a citable unit of research data?
  • What about cross-institutional, international, or cross-disciplinary research? Who mints the DOIs?
  • A need for DataCite to provide case studies, perhaps with future workshops.
  • It is only possible to describe one resource type per DOI (and this is a fixed controlled list e.g. Image, Film, etc) – this may be problematic with visual arts e.g. an exhibition; how do you describe complex relationships?

For cost/charge plans – discuss with the UK regional Member via datasets@bl.uk

The next DataCite workshop will be on metadata on Friday 6th July, details will be published online in due course.

Some other links:


Meeting at Goldsmiths College, University of London

Goldsmiths College, University of London.
Photo: MTG

All the members of the project team met at Goldsmiths College, University of London last week in order to collaborate on two aspects of the project: the development of the RDM policies, and the promotion of KAPTUR.

Key points:

  • The Project Manager is working with Angus Whyte and Andrew McHugh from the Digital Curation Centre on two small events: Selecting and Appraising Research Data; and Using the OAIS Functional Model.
  • The Project Officers are each raising awareness of the importance of effective Research Data Management at their institutions; developments seem to be quite diverse between the four institutions although the KAPTUR project represents an opportunity to learn from each other and the nuances of each different institution.
  • The Technical Manager will be installing DataFlow’s DataStage onto his laptop so we can have a demonstration at the next meeting.
  • The Project Officers and Project Manager are working on publicity to further promote the Environmental Assessment report.
  • The Project Director reported that KAPTUR’s abstract for the Digital Humanities Congress 2012, University of Sheffield, had been successful.

Building a pilot demonstrator service for the visual arts

The following blog post is adapted from the Conclusion and Recommendations section of the Technical Analysis report (PDF):

The KAPTUR Technical Manager investigated 17 different types of software which were compared to the requirements of the four partner institutions (details and appendices in the report). The next stage of the research reduced the choice of software to five options: DataFlow, DSpace, EPrints, Fedora, Figshare. These were all found to be suitable for managing research data in the visual arts; through a further selection process EPrints, Figshare, and DataFlow were identified as the strongest contenders.

[…] it is recommended that two pilots occur side by side: an integration of EPrints with Figshare and a separate piece of work linking DataFlow’s DataStage with EPrints. By integrating EPrints with Figshare, the project can take advantage of a system which has been built with, and for, researchers to handle research data specifically, and has a user-friendly visual interface (which is constantly evolving and enhanced by Figshare directly). […]By integrating DataStage with EPrints the research data storage and software will be hosted within each institution, providing them with better control over the type of data that can be stored, published and managed. The integration will also enable content uploaded in DataStage to be securely backed up by the institution and accessible from anywhere in the world. A ‘Dropbox’-like tool is featured in the latest beta version, providing a user-friendly interface which will benefit visual arts researchers. EPrints will effectively provide the role of DataFlow’s DataBank.


Kaptur – seven months into the project (7/18)

This is our update for the end of the seventh month:

WP1: Project Management

WP3: Technical Infrastructure

WP4: Modelling

  • University of the Arts London (UAL) held its first Research Data Management (RDM) working group meeting on Tuesday 10th April; the Kaptur RDM discussion paper was amended for UAL’s use and went forward to their Research Standards and Development Committee on 1st May.

WP7: Dissemination

  • The Project Officers were asked to suggest three ways to increase the profile of the project, including: an internal event, an internal website/newsletter/email, and something innovative.
  • The GSA Project Officer has had some Web training with a view to adding information about the project to the Research Web pages; Kaptur has already been promoted on the GSA Facebook page.
  • The Goldsmiths Project Officer has given two presentations about the project on 26 April and on 2 May for Library and Research Office staff; an item will also be in the May edition of the Goldsmiths Research newsletter.
  • The UCA Project Officer gave a presentation at the Staff Research Conference about the challenges and opportunities of running an institutional repository which included information on the Kultur, Kultivate, eNova and Kaptur projects; a targeted email will be sent to faculty librarians to provide more information about Kaptur.
  • The UAL Project Officer will submit a paper for the UAL Information Services Staff Conference (September), and an article for the Library Services Newsletter.
  • Watch this space for more creative dissemination ideas, several are in discussion including events, videos, music and artwork!

4. Issues/challenges

Following the original vision of the Principal Investigator (Project Director) the collaborative aspect of KAPTUR is working really well, as in particular at this month’s meeting we were able to learn and reflect on different approaches at each institution regarding the modelling workpackage.

The challenge this month has been to select the system for the KAPTUR pilot technical infrastructure. The research method led to a short-list of five systems, all of which were similar in ‘score’ based on the user requirements, this required the application of an additional selection process. A blog post will be forthcoming about this.


Research Data Hack Day in Manchester

Graham Plumb. 2000. Computer-Related Design. Photo: Dominic Tschudin. Collection: Royal College of Art Photographic Record of Student Work, 1960-2002.
© Royal College of Art

The following blog post is by Carlos Silva, Technical Manager for Kaptur:

The Hack Day started with quick presentations from attendees to find out about our projects, our interests, pose questions and to start assembling teams who shared similar ideas, ambitions and problems.

By the end of the afternoon we were allocated a team and a task to do and started working on a particular problem.

There were four teams which covered the following topics:

  1. Stakeholder Driven Metadata
  2. Dropbox for Institutions
  3. SWORD 2 protocol and Bit Torrent
  4. Data collection from research activities

1. Stakeholder Driven Metadata

Using a metadata map we were trying to map different schemas such as Dublin Core with OAI-PMH and the British Library.

Looking at this from a users perspective, the users will need to follow a certain workflow, for example using a DMP and so on (N.B. view prezi about this).

The team also worked on an example to show different types of handling DOIs and metadata between different schemas: http://homes.ukoln.ac.uk/~ab318/datacite/

I mentioned that the Kaptur project involves creating a model of best practice in management of visual arts research data and how using different types of metadata schemas was a problem for some institutions. I also mentioned that researchers in our sector need to handle different types of data and not only large amounts of data but also different metadata schemas and fields that may not be covered by the default Dublin Core or OAI-MPH schemas.

Finally there was an unofficial launch of the Journal of Open Research Software: http://openresearchsoftware.metajnl.com

2. Dropbox for Institutions

Sparkleshare was mentioned during the presentation, but it was noted that it is unstable to use in production environments.

A blogpost is available here with more information: http://blogs.bath.ac.uk/research360/2012/05/mrd-hack-days-file-backup-sync-and-versioning-or-the-academic-dropbox/

3. SWORD 2 protocol and Bit Torrent

SWORD 2 is a protocol for depositing content and its metadata with a repository.

The issue for this group to discuss, was to how to enable any type of file to be deposited.

Big deposits can take a long time to transfer; this isn’t a problem in itself, but there are problems around it. For example you can do partial uploads, however if the transfer is interrupted the repository will not be able to create a record.

Using SWORD and Bit Torrent the team were trying to tackle the problem by splitting the file into chunks, which will allow submitting large files and allow them to upload them into the server despite interruptions.

Advantages could be found immediately: it is secure, you can track it and also limit the number of uploads.

This project won support for further enhancement and will receive two days paid by JISC to further enhance it and develop it.

4. Data collection from research activities

The concept was straightforward: when people start to upload content, information will come not only from the users, but also from the actual file itself.

The team attempted to build an API to do this, however further time was needed to complete this.

Ultimately the project was intended to be a very big feed that will tell what has been done around the whole record such as visits by a researcher, modifications to the file, anything to do with the record so that all that information could be gathered by the System Admin to create reports.