And the winning platform is...
This is a follow up to one of the first blog posts about pros and cons of several Data Management systems. Before Christmas, Steve worked hard to complete the installation of the few software packages we wanted to test: Fedora Commons with the Islandora Drupal module, DataVerse, and DSpace. We were also hoping to obtain a demo version of DataFlow to test along with the other packages: we met the DataFlow team in Nottingham and they were planning to release some Debian packages by the end of the year, but that revealed to be more difficult than expected. Currently, the guys in Oxford are planning to send us a virtual machine image by the end of January.
Because our project is only six months long, it is now time for us to decide which platform we are going to use for our prototype Data Management System. To make a decision, we used a number of criteria:
- how many of the user requirements does the platform meet out of the box
- how easy it is to install, have it running, and maintain it
- how easy it is to customise it (looks, metadata management)
- how many standards does it support
- how well developed, supported, and widely used it is
Let's start from the user requirements. These were collected from the first round of interviews (post one and two), and may not be complete, but they can give us a rough idea of what our colleagues would like to use. We are currently running an online survey (only accessible by people at C4DM) to collect more information in this regard. The following table summarises the main user requirements:
|Seamless access (virtual disk) and/or command line access with batch import/export support||Batch ingestion||Batch ingestion with templates||No||Virtual disk (DataStage)|
|Easy to use web interface for searching published datasets||Via Drupal (Islandora)||Two available (Manakin is the best looking)||Yes, very polished||Basic|
|Advanced metadata-based search functionalities||Yes||Yes||Yes||Yes|
|Customisable metadata and RDF support||Highly customizable + RDF||Customizable, no RDF||Limited, no RDF||Yes, don't know about RDF|
|Version control for datasets||Yes||Not yet||Yes||Yes|
|Multi-level access policies||Several||Yes||Yes||Yes|
|Link data with publications (DOI, handle.net)||handle.net||handle.net||handle.net||DOI (DataCite)|
On paper, DataFlow is a winner: it meets (almost) all our requirements, especially because of DataStage, something other platforms don't offer. DataStage would be particularly appreciated because it would make the integration of the system in the research workflow much less disruptive. The other three packages are very similar, being all repository platforms. Dataverse is the less rich in features, but has a very polished interface. DSpace and Fedora are very close, with the latter supporting versioning. The Manakin GUI of DSpace looks very nice, while for Fedora it all depends on how the Islandora module for Drupal is customised.
This leads us to the second and third criteria: installation and maintenance, and customization. The easiest platform to install and run was definitely DSpace, thanks also to the vast amount of help available online. Dataverse was also relatively easy to install by following the documentation. After installing Fedora, Ivan first and Steve later had big problems integrating it with Drupal because of the complex access policies implemented in the software. In fact, we gave up on Fedora because we couldn't manage to make it work properly. Clearly, we can't tell anything about DataFlow since we haven't got it yet. Assuming we get a virtual machine image (VMWare), there shouldn't be any problem making it work.
For the same reason, we have no idea about how easy it is to customise DataFlow. The only thing we know is that it uses Python, while all the other platforms are written in Java. We want a customisable platform because the data produced at C4DM is very heterogeneous, including audio and video files, manual annotation of recordings, MIDI files and automatic score transcriptions, auditory signals, Matlab data files.... This means that we might need to create (or adapt) metadata schemas to accommodate all the different data types. Nevertheless, we will also have to find a balance point between the number and detail of metadata schemas, and the time people can spend filling in this information, as well as the sustainability of such an approach (i.e. we don't know if there will be someone in the future available to create new schemas as new data type arise).
In order of extensibility, Fedora comes first, followed by DSpace. Dataverse seems a rather closed platform. Finally, since there probably won't be a lot of resources to maintain the system, we want to reduce the need for technical support to a minimum. From this point of view, DataVerse and DSpace seem the best options, while Fedora (because of its complexity) and DataFlow (because it is still under development) might require more continuous care.
Another important aspect to consider is the management of metadata: what we are looking for is a platform that allows us to create and customise metadata schemas because we have several data formats we want to ingest, and metadata are an essential part of the work to create a data set. It is important to point out that there is an ongoing project at C4DM the (see Isophonics website) to define specific ontologies to describe multimedia and music objects (e.g. Music ontology, Audio features ontologies, Studio ontologies...). The Resource Description Framework (RDF) is also extensively used. We would thus like to incorporate these two aspects in our system. DSpace and especially Fedora are very customisable, while DataVerse is not. As for DataFlow, we don't really know, but we assume it will be customisable as well.
As for standards, all platforms include support for Dublin Core metadata and OAI-PMH (metadata harvesting). RDF is build in DSpace and Fedora, which also has the widest support for authentication methods, although it is enough to have LDAP so that we can use our authentication server (Fedora, DSpace and DataFlow have that). What we really liked about Fedora, DSpace and DataFlow is the support for SWORD, a protocol for data exchange between repositories.
Finally, an important aspect to consider when choosing a software solution is how widely used, and advanced in the developed life cycle, it is. Here, the last to arrive has been DataFlow. The other platforms are all in an advanced stage, but the most widely used is by far DSpace (for example, see here for a chart on the usage of open access repository software registered on OpenDOAR).
And the winner is...
Given all the previous considerations, we selected DSpace for our pilot data management system. In general, Fedora seems a bit too complex and requires too much work from our part to be able to deploy a working version within our short time frame. The fact that it is quite hard to set up is a further indication of its complexity, which would probably require more active support and maintenance. Dataverse, on the contrary, is too limited in its functionalities, especially regarding metadata customisation. What we liked about DSpace is the possibility to create new modules, for examples for automatically extracting metadata from multimedia file headers. Steve has already started working on this, more about it in a future post. Had DataFlow come out just a few months earlier, it would have probably been our first choice, but for now we prefer to go for the safest option.
Nevertheless, we will test DataFlow when we have a chance. Our hope is, in the future, to actually combine DataStage with DSpace using the SWORD protocol to transfer data sets. This would allow us to have the flexibility of a virtual hard drive for everyday use (DataStage) with the stability of a DSpace repository. Furthermore, as previously mentioned, Queen Mary University already has a DSpace institutional repository in place, albeit not very much used (but this might change in the future), which is another strong reason for choosing DSpace. These are, of course, our specific considerations, based on our requirements and our time constraints.