With thanks to Carlos Silva, KAPTUR Technical Manager, for the following blog post.
On 18th February I attended a workshop led by the JISC funded Orbital project, to gather information about the open source software CKAN and how it could be used to support research data management in the academic sector.
The workshop started with a presentation from Mark Wainwright (community co-ordinator for the Open Knowledge Foundation) on the latest release of CKAN, its origins and potential in the academic community.
One of the big advantages with using CKAN is that the ‘core’ system is surrounded by APIs allowing it to be flexible enough to accommodate different user and institutional needs. This means that the core software can be updated without affecting the APIs or having to adapt external code to fit with the core software.
Another important feature that looks promising is the ability of CKAN to not only harvest other CKAN databases, but also to search other types of repositories such as EPrints and DSpace. The mechanism developed covers different repository sources not only EPrints and DSpace, but also Geospatial Servers, Web catalogues and other HTML index pages.
In terms of sustainability, CKAN has been developed over the last 6 years, so it is relatively mature now with an extensive and very streamlined workflow process to add features, fix bugs and enhance the core services. The latest version 2.0 (recently released as Beta) promises to be an exciting release with more visually enhanced tools, improved groups feature, customisable metadata and a rich search experience based on their Apache Solr search.
The workshop continued with a presentation from the data.bris project at the University of Bristol. It is amazing to note that each Principle Investigator can apply for up to 5TB of storage for free and backed up securely for 20 years!
Academics receive a mapped network drive which they can access and use to deposit content, however this requires additional features to manage research data. Therefore, the data.bris project was interested in CKAN due to its flexibility, data access (ability to have private datasets), organisation schema, ability to share with external researchers and the CKAN search engine.
In the future, the University of Bristol is considering two instances of CKAN, one for a public read-only catalogue of research data publications and another for controlled access (which would include teaching and other types of data).
The third presentation was from Orbital; Project Manager Joss Winn provided a virtual tour of the latest tools developed by the project. They have connected CKAN between different instances: to their EPrints repository and also to different departmental databases, such as an awards management system.
The Orbital set up allows their researchers to have different types of data located in a central place, this includes the policies, profiles, publications and analytics information from specific outputs, making the most of the CKAN software.
The demonstration included mention of the software created to enable deposit of data from CKAN to their EPrints repository – something which we have been anticipating for the last few months and is an exciting development for the sector. Orbital have released the code through Github which in theory should work with CKAN version 1.7. The functionality enables CKAN to submit the metadata to EPrints using the SWORD2 protocol but not the actual files themselves – instead a link is added to EPrints which links back to the files deposited in CKAN.
The Orbital team are proposing a two year roadmap to their senior management team to take responsibility and carry this project forward and embed it further into the University of Lincoln’s infrastructure.
During the group discussion session, workshop participants suggested a comprehensive list of about 80 tools, features, amendments and requests that we would like to see as part of a new version of CKAN (a Google Docs spreadsheet is available: http://lncn.eu/mxz2). Again in groups we did a GAP analysis for the specific items requested and a CKAN expert was available to answer any questions.
As an academic community we found that there were lots of similar challenges which should be easier to address collaboratively.
From the visual arts community perspective although CKAN can’t currently address all the requirements from our user requirements list (PDF) there is scope for further development and this is continuing in the right direction.
With thanks to Carlos Silva, KAPTUR Technical Manager, for the following blog post. The Digital Curation Centre’s (DCC) Research Data Management Forum was held at Madingley Hall, Cambridge from 14th to the 15th November 2012; presentations from the event are available online.
“Technology aspirations for research data management”
The take-home message for the day was that IT will need to be more involved with research and their collaboration will have an impact for future grants, projects and sustainability.
Jonathan Tedds presented lessons learned from University of Leicester via projects such as the UK Research Data Service (UKRDS) pathfinder study and Halogen as well as from other projects such as Orbital. Jonathan covered ‘top-tips’ to get researchers’ attention and how to develop software as a service through the BRISSkit project (Biomedical Research Infrastructure Software Service kit).
Steve Hitchcock covered lessons learned from DataPool on building RDM repositories. The project was specifically to do with SharePoint and EPrints however KAPTUR did get a mention as an example of other projects using EPrints and not re-inventing the wheel. Published in July 2012, an application in the EPrints Bazaar called Data Core:
“Changes the core metadata and workflow of EPrints to make it more focused for as a dataset repository. The workflow is trimmed for simplicity. The review buffer is removed to give users better control of their data.”
Paul O’Shaughnessy from Queen Marys, University of London, spoke about how their IT services are changing and how different parts of the university needed to be involved in making this happen. The University currently has around 16,000 students; they started an IT transformation programme, because their original set-up was not fit-for-purpose, for example there were 7 different email systems. After creating a strategic plan for the next 5 years they realised that a third of their funding income comes from research grants so investing in IT infrastructure to support this was crucial. They were investing from 3 – 4% whereas other Russell Group Universities tend to invest from 5- 10%. They followed a greenfield approach and mentioned the importance of letting the staff know that it was not just IT who will need to be involved and not just another project. An interesting number was that 25% of HSS grant applications were lost because of poor IT sections.
The aim of the Janet brokerage services is to become a community cloud of available resources, by:
- developing frameworks and procurement structures such as DPS to facilitate access to services
- working with DCC and JISC to ensure sensible requirements and priorities
- hoping to get to a conclusion early next year about these services (Janet is currently in talks with Google AWS, Dropbox and Microsoft Azure will probably follow)
There was a comment about limitations with Dropbox but also possibilities that universities may be able to use it in the future and overcoming the current issues of storing research data outside the EU.
Other topics and interesting points from the discussion:
- Suggestion that just as there are Faculty Librarians, we should have Faculty IT people.
- Recommendation to negotiate resources with IT, for example if there is someone with the skills try not to use that person to fix printers but for something more productive.
- A Russell Group University mentioned that 1TB of data stored over 30 years will cost close to £25,000.
Break-out session on the Engineering and Physcial Sciences Research Council (EPSRC)
There was discussion about the research data that they expect projects to make available. They mentioned the importance of joining and gathering together all metadata; and of bringing IT together; a drip feeding of information (for example through OAI, SWORD, other protocols to transfer information and allow metadata to be harvested).
Overall it was a good workshop which provided different points of view but at the same time made me realise that all the institutions are facing similar issues. IT departments will need to work more closely with other departments, and in particular the Library and Research Office in order to secure funding and make sustainable decisions about software.
Finally a ‘flexible’ yet, intelligent approach should be taken from IT for example the use of PRINCE2 methods do not fit research projects as they all change during the duration of the project. The Agile methodology should be used; involvement and knowledge about this from IT should be expected.
With thanks to Carlos Silva, Technical Manager, for the following blog post:
The KAPTUR Technical Analysis report (PDF) recommended the piloting and further investigation of two different systems: DataStage to EPrints; and Figshare to EPrints.
Figshare to Eprints
Some of the advantages of integrating Figshare with EPrints are:
- The Figshare team is currently working on a desktop uploading tool to allow users a streamlined process of submission.
- Feedback from the Steering Group was that the user interface of Figshare was attractive and clear; it is already being used by researchers to store and manage research data and therefore the integration with EPrints would enable many institutions (as EPrints is the major repository platform in the UK) to encourage researchers to better manage their research data and then upload selectively to an institutional repository for publication.
Following telephone and Skype chats with the Figshare team a requirements document was created and shared with project partners and Simon Hodson. The idea was to create an API which would be free for use by any institution who wanted to link Figshare with an EPrints repository using the SWORD 2 protocol. Additional features included the development of the desktop uploader; a custom user interface design; back-end application development; and custom user accounts for the KAPTUR project partners to test the system.
Currently, negotiations are still in progress and further thought has been given to the infrastructure and pricing models that will eventually have an impact when adopting a commercial approach with technologies such as Figshare and that if not considered could lead to an unsustainable solution for the sector.
DataStage to EPrints
The second pilot recommended by the report was to link DataStage (from the JISC funded DataFlow project* with EPrints. The technical implementation of this pilot started in June 2012 when the Technical Manager set-up DataStage and DataBank on a local machine; demonstrated this to the Project Officers (in June) and the Steering Group (in July) and started collecting feedback on this. After testing the DataFlow software internally, the team started to explore the best way of linking DataStage with EPrints directly.
The advantages of integrating DataStage with EPrints are:
- DataStage offers the potential of being institutionally based, and therefore tighter control.
- It provides a structured metadata collection interface.
- It also provides flexibility when uploading, for example with the integration of a shared drive which uses a popular storage approach similar to Dropbox but with the advantages that the data is held on the institution’s servers.
The Technical Manager through VADS’ host institution – the University for the Creative Arts – set-up a test environment for the KAPTUR project (http://kaptur.ucreative.ac.uk). Test accounts have been given to project partners and an online feedback form set-up to capture this information.
To test the DataStage connection with EPrints, a test repository with the latest EPrints version (3.3.10) was needed in order to use the SWORD 2 protocol; this was created (http://kaptur_repo.ucreative.ac.uk).
Both systems have been tested separately, and both systems have performed well.
The DataStage software should allow users to submit entire folders as ‘packages’ to a repository using the SWORD2 protocol, however currently there is an issue** with the default version of DataStage and no transfers can be done on any other repository other than into Databank (the DataFlow project’s repository).
As well as contacting DataFlow and EPrints, the Technical Manager has been in contact with various colleagues across the sector, from the Centre for Digital Music at Queen Mary, University of London (see blog post about connecting DataStage with DSpace) to other colleagues who have also looked into connecting DataStage with EPrints such as the UK Data Archive, University of Essex and the RoaDMaP project, University of Leeds.
At this point there are the following conclusions:
- EPrints 3.3 is required in order to have SWORD 2 fully enabled [completed].
- EPrints have tested the SWORD 2 protocol successfully with other EPrints repositories, however connectivity with other types of repositories hasn’t been tested by EPrints yet.
- The DataFlow project manager replied saying that there were issues with the SWORD submission on the DataStage side, however they were expecting to come up with a workaround for their V 1.0 release [It is noted that Richard Jones will be presenting about DataFlow at the JISCMRD Nottingham programme event so this is hopeful!!].
- The lead DataStage developer mentioned that SWORD2 was envisioned to fully work with DataStage and EPrints when it becomes available and that previous versions of DataStage managed to work okay with EPrints, however due to new developments and enhancements at either end some changes in the DataStage side need to happen before it fully complies and can connect with EPrints.
*DataFlow was funded by JISC, under the University Modernisation Fund, from June 2011 – May 2012 to further develop a prototype out of the JISC-funded ADMIRAL project (2009-11).
Key points from the meeting:
- It was noted that there was diversity among the four institutions in terms of drafting the RDM policies – we can still collaborate and learn from each other – but the approach is necessarily different at each institution.
- University of the Arts London are really benefiting from their participation in the DCC University Engagement programme; the UAL Project Officer is working an extra day per week on this and as a result has been able to revisit and extend the KAPTUR Environmental Assessment through 20 x 5 minute telephone calls which will be followed up with 1 hour in-depth interviews with visual arts researchers.
- There was discussion about a definition for visual arts research data and how this might be constraining, but was needed at the same time in order to be able to move forward with the RDM policies. A working definition was presented to the KAPTUR Steering Group 3 months ago in response to questions raised by the UAL working group: http://www.slideshare.net/kaptur_mrd/kaptur-news06
- Feedback on training/support and the KAPTUR toolkits: recommendation to create KAPTUR videos about visual arts research data instead of hosting workshops at each institution (we already had plans to re-use content from the previous JISCMRD programme e.g. http://www.youtube.com/user/GUdatamanagement). I still think the face-to-face aspect of the workshops would be useful, but maybe there is a way to incorporate shorter sessions and use the videos as part of these? We will discuss at our next project team meeting in September.
- The Steering Group liked the Figshare interface and thought it would be appealing to visual arts researchers as well as easy to use; there were lots of questions about both DataStage and Figshare.
- Feedback on Sustainability: recommendation to get an idea of costs of the proposed technical infrastructure to include estimates of staff time required for ongoing support of the systems.
The presentations are available from SlideShare.
It was great to welcome Laura Molloy, Researcher at the Humanities Advanced Technology and Information Institute (HATII), to the Steering Group meeting. After the meeting Leigh, Laura and I met to discuss the project from the perspective of her role as JISCMRD Evidence Gatherer. As well as discussing impact and gathering evidence about benefits, Laura also came up with the concept of the chariot (KAPTUR project) being pulled by four horses (our four institutions). I really liked this idea of the race and also the need for collaboration to be well-matched in order to make the project successful.
Since the beginning of the KAPTUR project, the Technical Manager has maintained contact with the UCA IT department to ensure they are aware of the project and its requirements. Work requests to IT have established precise deadlines, however for the purposes of this blog post the following tasks are represented in a month-by-month format for easy viewing:
|KAPTUR Technical Manager||Download and install DataFlow on a local environment (not server).||X|
|KAPTUR Technical Manager||Once DataFlow is stable, setup EPrints in the same local environment.||X|
|KAPTUR Technical Manager||Development work to link DataFlow with EPrints.||X|
|KAPTUR project team and partner institutions||First round of feedback and tests.||X|
|UCA IT Department||Create Virtual Machine (VM) on servers hosted by UCA.||X|
|UCA IT Department||Assign the URL kaptur.ucreative.ac.uk to the VM.||X|
|UCA IT Department||Create back-up mechanisms in line with usual procedures for the VM.||X|
|KAPTUR Technical Manager||Transfer local development environment to the new VM.||X|
|KAPTUR Technical Manager||Test and debug.||X|
|KAPTUR project team and partner institutions||Further feedback and tests of the system via the VM.||X|
|KAPTUR Technical Manager||Further tests and debugging leading to initial pilot system.||X|
The following blog post is adapted from the Conclusion and Recommendations section of the Technical Analysis report (PDF):
The KAPTUR Technical Manager investigated 17 different types of software which were compared to the requirements of the four partner institutions (details and appendices in the report). The next stage of the research reduced the choice of software to five options: DataFlow, DSpace, EPrints, Fedora, Figshare. These were all found to be suitable for managing research data in the visual arts; through a further selection process EPrints, Figshare, and DataFlow were identified as the strongest contenders.
[…] it is recommended that two pilots occur side by side: an integration of EPrints with Figshare and a separate piece of work linking DataFlow’s DataStage with EPrints. By integrating EPrints with Figshare, the project can take advantage of a system which has been built with, and for, researchers to handle research data specifically, and has a user-friendly visual interface (which is constantly evolving and enhanced by Figshare directly). […]By integrating DataStage with EPrints the research data storage and software will be hosted within each institution, providing them with better control over the type of data that can be stored, published and managed. The integration will also enable content uploaded in DataStage to be securely backed up by the institution and accessible from anywhere in the world. A ‘Dropbox’-like tool is featured in the latest beta version, providing a user-friendly interface which will benefit visual arts researchers. EPrints will effectively provide the role of DataFlow’s DataBank.