The DCC Roadshow in Cambridge, Day OnePosted: November 14, 2011
The following blog post has been written by Tahani Nadim, Kaptur Project Officer, Goldsmiths, University of London.
The sixth DCC Roadshow on data management, organized in conjunction with Cambridge University Library, began with DCC’s own Associate Director, Graham Pryor, highlighting the current big theme summarized by “3 Rs”: re-use, regeneration and repurposing of data. His talk focused on the scale and complexity of data generation in all sciences though, once more, the “hard” sciences received most attention with examples like the Large Hadron Collider (15 petabytes of data annually) and GenBank, the NCBI’s nucleotide sequence database (holding approx.130 billion bases in 140 million sequence records in the traditional GenBank divisions). Nathan Cunningham, of the British Antarctic Survey’s (BAS) Polar Data Centre, gave some very dazzling and dizzying examples of the range and complexity of data produced by the BAS – “data bling” and “Disney science” as he called it. Some of the challenges faced by Cunningham and colleagues relate to turning unstructured into structured data; describing data in such a way as to make it discoverable and useable; and, importantly, finding ways to automate this.
For Cunningham, so-called data “mash-ups” (combining data on e.g. sea surface temperature, feeding routes of penguins, chlorophyll levels or high-resolution sea ice images) provide decision-making tools as well as diagnostic tools. David Shotton, a cell biologist turned bioinformatics guru, made very similar arguments for the biosciences. Introducing a host of data curation projects, particularly focused on digital imaging, Shotton pointed to reasons why many researchers still do not publish their data: information and work overload; pressure for financial viability (to get money for their departments); cognitive overheads and skills barriers. The latter was also very clear from Cuningham’s presentation: data curation requires specialised knowledge of the date-generating discipline and can more than often not be ‘delegated’.
The presentations by Pryor, Cunningham and Shotton left little doubt about the fact that data sets are becoming the new instruments of science and establishing new ways of working (e.g. collaborative modelling in global virtual laboratory as done in the neurosciences in the CARMEN project) but this poses a number of critical questions for researchers and institutions alike: Who will analyse all this data and how? Is digital data the new special collections? Regarding regulation, Pryor noted that in some cases, for example in the case of European IP laws, regulation actively obstructs data sharing as well as digital preservation. Pryor voiced concerns about the handling of data management requirements amongst research councils’ policies, pointing in particular at the EPSRC’s timescale and vague language.
In terms of providing access to this data, Pryor introduced some commendable initiatives such as the Panton Principles as well as open science applications such as the Citizen Science Alliance. Again, open data throws up a lot of questions: How to be “open” but also how far to go with being “open”? What are the incentives for being “open”? How to handle sensitive data (particularly in the biomedical sciences)? One study on the current handling of research data mentioned by Pryor, the Incremental project, was later described in more detail by Elin Stangeland of University of Cambridge’s DSpace repository. A JISC-funded collaboration between Cambridge and the University of Glasgow the project produced a scoping study before drawing together guidance and support literature, provding training in data curation and creating audiovisual learning resources.
A different perspective was offered by Dr Anne Alexander. Actually, a doubly different perspective since this presentation came from a researcher in the humanities. Alexander’s research focuses on Middle Eastern politics, particularly the labour movements and similar political movements in the region. Her current project, which looks at the Egyptian revolution, demonstrates the dramatic transformation in data resources she engages with. Commencing her presentation with an image of her usual data such as notes, newsletter, newspapers as well as analogue tapes, the remaining part of her talk is accompanied by Facebook pages, Twitter feeds, YouTube videos and other social media platforms. Alexander argued that the political landscape has radically taken in the novel spaces offered by social media: the strike committee of sugar refinery workers in Egypt, the strike committee of doctors in Egypt as well as the ruling military council have Facebook pages which are actively enrolled in their respective political practices.
The problems faced by the researcher are plentiful: How to capture (save, store, make discoverable etc) not just the discrete data entity (the tweet, the video, the picture, the status update, etc.) but the context, that is, the comments, the other “recommended” or “related” content and other dynamically created relations and objects. Another issue pertains to the difference between public and published: pulling comments made by activists against authorities out of the digital realm (e.g. a Facebook wall) and committing them to paper and/or circulating them by other means and routes poses serious ethical questions. Equally confounding is the problem of “ownership” raised in the discussion: If everything is owned by Facebook – what is a researcher to do?
In conclusion, Alexander suggested that it is not helpful to think of the Internet as an infinite archive. This gives us a false sense of security. Instead, researchers need to acquire archival skills.