Day 2, Seventh RDMF, University of Warwick

Scarman House, University of Warwick. Image used with permission of Warwick Conferences.

The setting for RDMF7 was a great venue, good food, and good company; it worked well with having attendees limited to 50 to encourage discussion. Day 2 below is detailed in terms of useful links and key points.

Impact through data management: Where are the wins? What are the pitfalls? – Cameron Neylon, Senior Scientist, Science & Technology Facilities Council

Cameron spoke about motivation from his own perspective as a researcher; there is motivation to spend time to do a publication, but not necessarily to prepare a dataset. The motivation for good data management practice might be that it then makes things easily available when I need to write a paper.

A dataset needs to be clearly associated with the research questions, and needs to record how it can be discovered/re-used.

At the moment there is a focus on data, but there are other aspects we need to consider as well, including process, software and materials.

The funders’ role as motivators. A myth or achievable reality? – Ben Ryan, Senior Evaluation Manager, EPSRC

EPSRC Timetable for building RDM capacity: by May 2012, organizations should have a roadmap in place, should be compliant by May 2015

Data sharing agreements can cover copyright and other issues, but enough data should be out there to describe the data and limitations and how to access it i.e. if there are limitations.

EPRSC sets a deadline of 12 months after data creation; they then expect access to both physical and digital data for a minimum term of 10 years from last date on which access to the data was requested by a third party or from the date that any researcher’s ‘privileged access’ period expires.

Institutional measures to encourage and facilitate effective data management and sharing. A matter of cash, careers or cultural change? Miggie Pickton, Research Support Librarian, University of Northampton

When Miggie began the DAF investigation, little was known centrally at the University about data storage policy or procedure. They already had NECTAR, an EPrints repository, in place to store and preserve digital data; and a future development of the RDM work may be to create another EPrints repository to store research data then having some link-up between the research outputs and research data as appropriate.

They have established a Research Data Working Group comprising of the University Records Manager, a representative researcher, the Head of Research and Enterprise, and Miggie.
One of the issues with managing research data was selection and disposal – some researchers were reluctant to set a disposal date – ‘after I die’.
The Northampton approach was about encouragement rather than mandate.
Simplified internal procedures have been setup to monitor whether policy and procedures are being followed.

RDM is now a standard part of research student inductions. They also focus on dissemination via multiple communication channels e.g. school research forums, university website, one-to-ones; they involve records managers, library staff and researchers in development of training sessions and guidelines; gain support from opinion leaders (as well as senior managers) raising awareness amongst academic and support staff. They demonstrate the link between good RDM and career progression e.g. through increased data citation.

Where next? – Disseminate new policy to all schools and divisions; develop RDM training programme; and provide a storage facility.

Benefits analysis: challenges and opportunities. Neil Beagrie, Director of Consultancy, Charles Beagrie Ltd.

Neil clarified what he meant by ‘long term benefits’: benefits in the near term could be up to 5 years; in the long term, more than 5 years.

It should be noted that not all data downloads relate to data use; for example a teacher may download once, add it to their Virtual Learning Environment and through this make it available to 50 students who each download it, but the user stats just record the one download. There are also other examples of users downloading an item once, but then making intensive use of it for three years. Other users are browsing and downloading but then rejecting or not using it – a discard is not necessarily a negative – as this can be part of the learning process. Another question would be over a download or an ‘access’, for example someone doing lots of queries rather than actually doing a download – this shows a complexity of use.

Break-out sessions

Group 1: The drive for effective research data management – is it much ado about nothing?

The group decided straight away that it was not much ado about nothing; the case was how to motivate people to do RDM.
Some of the positive things suggested were: to embed principles and practices early on so it became part of the research lifecycle; to appeal to peoples’ self interest; to have good stories; to suggest practical tips; to make sure roles and responsibilities are clearly defined as ownership is needed at lower levels as well as at senior management level; and have infrastructure in place to make sure it is possible.
The group also considered the costs of data management, and whether there was a role for funders to make a positive example of institutions who are meeting their requirements already.

Group 2: What really are the sticks and carrots that will make a long-term difference to the pursuit of structured data management processes?

Ref: Paul Stainthorp’s blog post includes Group 2 feedback.

Group 3: Who pays and who reaps the benefit? The incentives for funders, institutions and researchers for investment in research data management.

  1. Research Funders to make funding available for research based solely on existing datasets – gives incentive to researchers.
  2. DCC to include costing module in to the DMP Online tool to allow sensible estimates of cost.
  3. Research Funders to be explicit that RDM costs should be included in applications.
  4. Publishers to review peer-review process to include validation of data (some comments about the peer-review process being slow enough as it is without including datasets as well; some disciplines already have measures in place to peer-review datasets).
  5. Research Institutes to provide training and support to researchers from early stages, in order to encourage best practice.
  6. Professional bodies to promote good RDM through policy and training.
  7. DCC to collect examples of the costs of bad RDM: institution’s/researcher’s reputation; financial costs; even the loss of lives (example given of a cancer patient study). Liz Lyon suggested that this could be a piece of work titled ‘Reputation and Risk’.

Successes in sharing: obtaining data from more than 1,000 sources worldwide – Catherine Moyes, Malaria Atlas Project Manager, University of Oxford

Malaria Atlas Project: http://www.map.ox.ac.uk/

The MAP team collate data and generate geo-spatial models; they are working across five continents with very large datasets. They work with four different data types: mosquito occurrence data points, genetic data points, parasite prevalence, and case incidence data points. One survey equals one data point; the data points are geo-positioned which adds value to the data.

Catherine showed the following table of ‘carrots versus sticks’:
carrots – sticks
direct funding – calling upon the journal
applications for funding – calling upon the original funder
co-authorship – calling upon the institution
citation –

Incentives you could offer to researchers include: to pay them money for data cleaning (rather than for the data itself); to offer services such as application writing, or providing a letter of support; to offer to write a joint paper; and to cite the researcher’s data.

The MAP project has not utilised any of the above incentives as it would have been impossible with obtaining data from over 1000 sources. Rather than providing incentives it is also possible to remove disincentives which MAP does by: explaining clearly and succinctly who they are and what they are doing; being precise about exactly what data they are requesting; if a request is linked to a paper, they read the paper carefully before making the request; MAP staff are complementary and diplomatic; they are persistent, not to the extent of ‘badgering’, though they will politely ask again and again; they will take on any work required to sort the data out (and don’t expect people to do anything more than email the data to MAP); they provide requested undertakings about use of the data; publically acknowledge all data providers online; and data request are made by a senior team member e.g. a professor who writes to them as a peer – writing in a complimentary manner. It also helps that the project began in 2005 and has also built up a good reputation. The 1000 sources worldwide include a wide variety of groups: ministry of health, non-governmental organisation, public health reports, journal articles, and academic researchers – but they have used the same approach with all types of groups. A permanent feature of their website is the ‘acknowledgements’ section which includes the name of all contributors – and also a section on acknowledgements within each individual dataset record as well.

Catherine posed the question: ‘why share raw data?’ as well as the usual answers she also mentioned the book: SHARING PUBLICATION-RELATED DATA AND MATERIALS.

There were some interesting slides about the requests they have handled to access their data. Their approach is now when MAP publish a paper then they also release the relevant data alongside this. One example of sensitive data that they can’t release is from Myanmar, Burma – as this would put lives at risk by including the location information of people. Out of all the people they asked permission to release the data only 2 people didn’t want to be cited; one contact was happy for their data to be released but they didn’t want to be contacted about it; and only one person said ‘no’.

There are no registration or access agreements as they want to encourage use and therefore get rid of barriers.

They have a PostgreSQL database behind the scenes, data then downloads as Comma Separated Values (CSV). A composite citation is included in the resulting spreadsheet with up to three different citations that may relate to one row of data. It should be noted that the project does not ask people to cite ‘MAP’.

data: no terms and conditions apply
software: github – GNU public licence (open source) and free
probability distributions: email them and they’ll send a DVD as the file is too big
GIS surfaces: creative commons licence and free
estimates of burden/populations at risk: no terms and conditions apply
map images: creative commons and licence and free

A member of the audience asked Catherine about sustainability. She mentioned that the data collection hasn’t stopped and that they are hoping the release of the data may encourage this even more, but ultimately it depends on how long people are still interested in the data and downloading it.

The Institutional Data Management Blueprint and incentivisation – Jeremy Frey, Professor of Physical Chemistry, University of Southampton

IDMB aims and objectives: “to create a practical and attainable institutional framework for managing research data that facilitates ambitious national and international e-research practice”; and “to produce a framework for managing research data that encompasses a whole institution (exemplified by the University of Southampton)”.

Frey presented some of the IDMB findings (PDF) for example in answer to a survey question: ‘Who do you believe owns your research data? Only 25% of respondents thought this was the ‘School/University’, Frey believes that most of the respondents weren’t aware as the answer should have been the institution in most cases.

The University of Southampton’s data policy introduces no new legal or other principles; it is mainly about just applying existing policies for physical objects to the digital objects that have been generated.

Almost 2/3rds of respondents answered that they were responsible for managing their data; most of this was stored on a ‘CD, DVD, USB, or external hard disk’; a significant number of people didn’t know how much data they had. They reported that ‘reusing their own data was relatively easy’ – Frey disagrees, maybe this was easy compared to accessing other researchers’ data. One participant mentioned visiting museums to refresh their memory of objects with varying degrees of success in locating them.

Frey suggested using the terms ‘context’ and ‘process’ rather than ‘metadata’.

There was a useful slide on data management costs.

During the Forum the group also considered research data as an object in its own right that may or may not be submitted to the REF 2014:

“In addition to printed academic work, research outputs may include, but are not limited to: new
materials, devices, images, artefacts, products and buildings; confidential or technical reports; intellectual property, whether in patents or other forms; performances, exhibits or events; work published in non-print media. An underpinning principle of the REF is that all forms of research output will be assessed on a fair and equal basis. Subpanels will not regard any particular form of output as of greater or lesser quality than another per se.”

REF 02.2011 Assessment framework and guidance on submissions, July 2011 (PDF) (106. on p.22)


Day 1, Seventh RDMF, University of Warwick

golden egg
Stock Photo: Golden Egg

The 7th Research Data Management Forum last week was really useful, so even though there are other blog posts available, it seemed necessary to add to them!

Links

Mark Thorley, NERC, suggested that datasets can be seen as ‘golden eggs’ with the citation of datasets as a reward for making them available. This could be supported if journals would only publish articles when they had a DOI for supporting data.

The resulting discussion brought up several questions, including ‘how do we define long term value of research data?’ NERC are working on a data value checklist to support their Data Policy:

“develop criteria to help identify data of long‐term value”

NERC Data Policy – Guidance Notes (PDF) Thorley, 2011 (p.5)

Other issues that were discussed included: ‘cost effectiveness’, meaning the time taken by researchers to complete a task as well as being value for money; what do we mean by ‘open’? i.e. open to everybody or just openness within the research community; and changing the way researchers regard data, but this depends on the subject area.

Thorley talked about the need to work with each research data actor and understand the different roles. Publishers were not represented in the four actors diagram, and there was also discussion about whether the general public should have a role.

Simon Hodson, JISC MRD Programme Manager suggested that some possible work would be to “break down, enunciate and list the various examples of ‘what’s in it for me’, for the various stakeholders”. This is an approach we have been looking at with the stakeholders for Kaptur – we hope to have a better idea of ‘what’s in it for visual arts researchers?’ after the environmental assessment interviews are completed. We are also gathering examples of information from other sources, for example Laura Molloy’s presentation at a Kultivate workshop on ‘Archiving and Curation’, 23rd March 2011, about the UK performance practitioner survey (PDF).