A bottom-up approach to research data management

During the past five and a half months we went from being completely unaware of data management and data curation, to having a draft data management policies document and a test data repository for the Centre for Digital Music. I had the impression, while participating in the programme launch at the beginning of December, that our project was a bit of an outsider, compared to other teams which clearly had a much longer experience in the field. This, I think, allowed us to have a slightly different point of view on the subject.

So what is different about our project? There are a few reasons I can think of:

  • Ours was a project run by researchers in digital music, for researchers in digital music, in a bottom-up approach. Usually, institutions (as it is happening now at QMUL as well) look at defining generic policies, and build generic data management systems in order to cover the entire university. These projects are run by IT services and libraries, which are the natural experts in information management. Of course, direct engagement with research groups in different areas is essential, as pointed out for example in the results of the MaDAM project. Our advantage is that the people who work on the project and develop the system are the very same who will also use the system. Working in the same office with the users is another great advantage because a discussion on the topic can start at any time, for example during lunch or coffee breaks.
  • Researchers at C4DM are mainly electronic engineers or computer scientists and have very strong IT and programming skills. Being software developers themselves, they have strong opinions on technical solutions, for example on how a repository should work, which makes deploying such solutions particularly difficult.
  • Research data in digital music is very heterogeneous, and changes very quickly. This makes it particularly hard to categorise, store, and curate using generic solutions (see also the post about metadata).
  • The group is relatively small, so it is not feasible to have a dedicated person that takes care of data publication and curation. As a consequence, the data creator/scientist (i.e. the researcher) is usually also the data curator/manager, in a sort of "self service" data management system. This slightly differs from more complex situations, for example at the institutional level, where the different roles are covered by specialised people (see for example Pryor, G. and Donnelly, M. (2009). Skilling up to do data: whose role, whose responsibility, whose career?. The International Journal of Digital Curation. Vol. 4(2), pp. 158--170).

A lot of information on data curation is available through resources such as the DCC and JISC, but this is very much focused on institutions rather than research groups. Of course, this is the more obvious way to approach the problem: institutions are increasingly forced to have data management policies in place, and they have the resources and clout to make things happen. Many of the concepts apply to small and large groups alike, but there are things that require more thought, such as writing policies.

Building a data management system using available solutions has also been (and still is) a complex problem. Our impression is that the most established software packages were initially created for librarians, with publications in mind. Storing research data involves more complex data and metadata structures, which often exposes the limitations of the existing systems. For example, it is very hard to fit the multi-layered metadata available for digital music data into DSpace (see also this post). Taking DSpace again as an example, the other limitation of these system is the clear need for administrators (data librarians and IT) that manage the workflows, import bulk data with administrative tools, control access policies etc. This does not integrate well with day-to-day research workflows in our group, and can partly prevent people from using the system. In this respect, we have been planning to work on a few solutions (see this post), and we really appreciate the work carried out by the DataFlow and DepositMO projects to integrate data management into the research workflow.

These are the opinions of the SMDMRD team, but we would really like to hear comments from other projects: do you agree with us? Do you feel you are in the same situation?