Online survey

In our quest to better understand how our colleagues manage their research data, and how to best customise our Data Management System (based on DSpace) to suit their needs, we started out by organising face to face interviews (see here, here, and here). Unfortunately, interviews are a very time-consuming activity, and because of conferences, lectures, holidays etc., it is very difficult to find suitable times for everyone to meet. For this reason, we decided to use an online survey, created on www.surveymonkey.com. We kept the survey short (it takes around 10 minutes to complete) to encourage people to respond.

The link to the private survey was sent out to the group's internal mailing list, and, so far, has been answered by 19 people. The online survey very closely resembles the pre-interview questionnaire we gave to our colleagues before meeting with them, so these people did not have to participate again. It comprises 10 questions, divided into 3 pages.

Page 1:

  1. Your name
  2. Your research role
  3. What are the main topics of your research?
  4. Do you have a particular strategy for storing and managing you research data? For example, where do you store your data? Are you creating regular back ups?

Page 2:

If you have never produced/collected a data set for your research yourself, skip to the next page. With "data set" we refer to any collection of digital data that you might have used to run an experiment. The data set can be either already published or still on your hard disk.
If you did create data sets yourself, please list the three that best represent your research work, and answer the relative questions in as much detail as possible.

  1. Data set 1:
    • Name
    • Data types/formats
    • How large is the data set?
    • What metadata is/can be attached?
    • Are data copyrighted?
    • Is the data set still being updated or is it in a final version?
    • Do you back up the data set? Where and how often?
    • What will happen to the data set once you leave C4DM?
    • Is the data set publicly available (i.e. available to researchers outside C4DM)?
    • If yes, please indicate at least one publication in which you used the data set.
    • If not, do you have any plans to publish it?
  2. Data set 2: ...
  3. Data set 3: ...

Page 3:

  1. Have you ever used other people's data sets in your research?
  2. If there were a centralised Research Data Management System at C4DM, would you use it? What features would the system need to have for you to be able to integrate it into your research work flow?
    • Upload data via web-interface
    • Command line tools for uploading and managing data and metadata
    • Access data as a virtual drive
    • Web-based advanced search functions (uses metadata)
    • Customisable metadata and free text descriptions
    • Download selected files in block as zip archives
    • Link data to software (e.g. SoundSoftware)
    • Versioning
    • RDF Support
    • DOIs Support
    • Possibility to create simple websites to expose datasets (e.g. related to a publication)
    • Other (please specify)
  3. Do you have any additional comments or suggestions?

Results

Data management and backup practices are very similar to those of the people we previously interviewed: data is usually kept on local hard disks and backed up more or less regularly in various ways. Some people like to use Dropbox or similar services to keep data synchronised across different machines.

About half of the respondents listed at least one data set, of which two are already published. Datasets include video, audio, and MIDI files; MATLAB data files; manual and automatic music annotations; text files containing notes, interview transcriptions, logs; RDF data dumps. Sizes range from few Kb up to several Gb. A third of the respondents would be interested in publishing some of their data, after either finalising their work (and writing a paper about it), or cleaning up and repackaging the data.

The response to question 9 (features) gave these results:

  • Upload via web: 12
  • Virtual drive: 11
  • Zip download: 9
  • Command line tools: 9
  • Advanced web search: 8
  • Links to software: 8
  • Versioning: 8
  • Publication website: 7
  • Customisable metadata: 6
  • RDF support: 4
  • DOI support: 1

Two observations, in my opinion, can be made from these results. First, the "virtual drive" access features would be very appreciated, but we will have to wait until we get a working release of DataStage/DataFlow to test it. This means that for the time being, our system will not be usable as a day-to-day "data dump", but only to store data sets in a more or less final, usable stage. Figure 1 illustrates how the system should work (image from the poster presented at DMRN+6). The Data Management System Server (red circle) will in this case be the DSpace repository.

Data management system
Figure 1: functional overview of the Data Management System to be implemented at C4DM.

Second, our colleagues do not seem very much interested in DOIs (Digital Object Identifiers). The point of using DOIs or other persistent identification standards (e.g. www.handle.net) is to allow for easier citation of datasets, with the predicted benefit of increasing their number. We suspect that one of the reasons for such a low response is that not everyone is familiar with DOIs. Also, because academic performance is measured on the base of paper citations, people prefer to refer to the paper describing the dataset, rather than to the dataset itself. This is an important aspect that needs to be address in a future phase, through training and promotion of data publication. On this point, see for example "Making the case for research data management" by the DCC ("...Datasets have yet to make a mark in research assessment terms compared with the traditional article. This is likely to change with evidence that making data related to an article publicly available correlates with higher citation rates, at least in fields that have built the necessary repositories, standards and collaborative culture.")

A few interesting comments also emerged. A question was raised regarding space and data transfer speed restrictions in case of large files such as HD video. At the beginning of the project, we reduced our scope to exclude audio and video data from our prototype system, both for copyright and for space reasons. Nevertheless, audio and video data are an integral part of research in digital music, so we will have to take them into consideration and soon as possible, after we manage to release a first version of the data management system.

Another question that popped up between the lines regarded the definition of meatadata (someone even explicitly wrote "I don't know exactly what you mean by metadata"). In fact, a lot of the data produced at C4DM is metadata in itself, in the form of information describing music (see for example the SPARQL Endpoints available on the OMRAS2 project's website). Some considerations will be required to decide how to treat these data, i.e. as actual metadata to be ingested using specified schemas, or as data, making it available as serialised metadata files with only the standard descriptive metadata "properly" indexed in DSpace. Explanations on how to manage this kind of metadata will also be required in future documentation and training material. This material will also need to clearly define the different parts of the system, and in particular what will be and won't be allowed on the internal repository as well as on the outside facing repository (Public Datasets Repository in Figure 1). This information will need to be included in the Data Management Policies document that we are currently working on drafting.