On 8th October, we gave our second conference tutorial on software and data for reproducible research. This time the conference was the International Symposium On Music Information Retrieval (ISMIR).

Music Information Retrieval (MIR) largely considers two distinct areas: finding useful information from audio (e.g. tempo, beats, pitch, chords, song structure) and finding audio based on user queries (e.g. playlist generation or music recommendation). As such, the emphasis is on analysis of audio and audio-related metadata.

The tutorial was longer than that at DAFx, with the addition of a live coding example from Chris Cannam and time for some some discussion at the end. My data management presentation was again about 30 minutes.

The data management content was based on the DAFx Tutorial. However, some bits were trimmed back and some additions were made. In particular, some very short slides between topics were removed as basically being placeholders for me rather than providing content and slides covering specific examples of research data loss were added.

Examples of people losing their PhD content include:

  • Laptop theft from car (with no backup) - 19 March 2008 Surrey Leader, Canada
  • Another laptop theft from car - 27 May 2010 The Press, NZ
  • Laptop stolen from car and backup out of date - 6 January 2011 NBC4i
  • Burglary laptop and hard drives stolen - 15 February 2011 Manawatu Standard, NZ
  • Theft of computer and safe containing backup from a flat - 22 December 2010 KRQE
  • Overwriting data with a backup - 5 October 2005 Linux Forums

I found data from the USENIX conferences on File and Storage Technologies (FAST) regarding storage hardware failures. In 2007, Eduardo Pinheiro and Wolf-Dietrich Weber from Google presented a paper on Failure Trends In A Large Disk Drive Population which revealed that from a population of some 100,000 disks used at Google, approximately 13% were replaced as parts of a repairs procedure over a 3 year interval.

The cloud is often seen as a suitable medium for for data storage - hence the "why not use DropBox ?" question at our DAFx tutorial. I therefore looked online for studies on whether the cloud is suitable for data management. The best resource I found for this was the JISC/DCC white paper on Curation In The Cloud.

I reviewed the terms and conditions (T&Cs) for various storage providers - particularly Apple iCloud, Microsoft SkyDrive, Google and DropBox. Both Microsoft and Google appear to require the right to do whatever they want with your data to "improve" their services. Google also retain those rights after you stop using their service. This is, of course, different from them actually doing any of those things - but be aware that under a less benevolent regime corporations may use their rights over your data. DropBox and Apple have better agreements: Apple restrict their right to publish content to content that you want published (really, Google and Microsoft just get the right to publish if it will "improve" their service); DropBox will only do things if they're necessary to oprovide the service you have requested.

In order to protect your data when it is stored in the cloud, we recommend encrypting your data. Possible solutions for this include using encrypted .dmg files on Mac OS, Truecrypt, BoxCryptor or encrypted cloud storage solutions such as SpiderOak.

Other concerns over storing data in the cloud include:

  • cost - although cloud storage is often free for small amounts of storage (say, a few Gb), as data increases you will either need to spread data across providers (and end up with multiple T&Cs and more effort finding your data) or pay (and maybe change to cheaper providers when possible)
  • bandwidth restrictions - how long will it take you to download your data ? are there charges for "excessive use" ?
  • access - cloud storage is usually not easily publicly accessible on the web
  • control over how data is stored - are the providers really looking after your data properly
  • control over physical location of data - the laws relating to data in the cloud may be based on where the data is stored rather than the location of the owner of the data. If you don't know where your data is, how can you know the legal status of your data ?

Several of these increase the risk of getting locked in to a provider (or the cost of changing provider). It is easy to upload data bit-by-bit to an online archive, it is more difficult if you want to change provider and now have a lot of data to transfer and no physical access to the data storage. This may leave your data at the mercy of changing costs.

The "Preserve" part of our research data management message was reinforced with this information, but finding evidence to convince researchers to document, organize and publish their data has (so far) proven more difficult.

In the ISMIR presentation, we also showed our solution to publishing research data - the research group data repository created during the SMDMRD project. This provided a framework to show how metadata can be added to datasets, and that the use of a research data repository allows datasets to be found through Google Scholar, correctly attributed to an author and showing the date, title and abstract for the dataset.

During ISMIR, several presentations covered the research process, both in terms of data and workflow. These included: a proposal for important metadata for MIR datasets1; a new dataset2; an examination of challenges in melody extraction (which raised issues on the meaning of annotations, the duration of samples within datasets and the size of available datasets)3; and supporting a sustainable research process using workflow systems4. The MIRrors special session at ISMIR looked back over work in MIR and considered how that can inform future development of MIR and included papers on the problems of interdisciplinary communication5 and on reusable research6. The latter of these promoted considering systems in terms of Research Objects, bundling together all the relevant information (e.g. data, methods, participants) for experiments and investigations as a complete package. Papers from this session and additional papers will form a special Issue of the Journal of Intelligent Information Systems expected to be published in Autumn 2013.

On the final Friday afternoon of the conference, a session was held for late-breaking papers and demos. This session was structured as an unconference at which the format and content was determined by the participants. Possible themes were submitted both before and during the conference. Several sessions were relevant to our project, showing an interest within the community in the topic.

I attended sessions on:
* Interfaces and Infrastructures
* (Meta)Evaluation
* MIR Challenges
* Teaching MIR

The full set of PDF handouts from the session are available on the SoundSoftware Project site and the data management presentation is available in both PDF and OpenOffice formats at the SoDaMaT project on


The full ISMIR proceedings for all 13 conferences are available online.

