DSpace access control: public and private data

One of the required features of the data management system for C4DM is the possibility to hide certain datasets from public access (for copyright or other reasons), but leave them accessible to C4DM staff. This was shown in the schematic overview in Figure 1 in the post about the online user survey. In the following post, I will discuss a few possible solutions.

A lot of data used for research at C4DM consists of copyrighted material. These data include several hundred audio files ripped from CDs or legally purchased online, as well as MIDI files and musical scores. We also use datasets we obtained from other research groups under special (usually verbal) non-disclosure agreements. At the moment, these data are stored on a private server only accessible from within QMUL (see Deliverable D1.2).

Managing these data is a bit tricky, though. We can in fact divide these "private" data into two categories:

  • Work in progress data. A lot of data is stored on the server and organised in folders. Most of the times, only a very basic description of the data is given in a README.txt file. These data are difficult to categorise, do not fit well into a repository structure, and would require a lot of work to make them ready for publication (even on a private repository).
  • Complete data sets that are simply not published for copyright reasons. For example, a set of commercial audio files used in an experiment published in a paper. In this case, the best solution would be to publish the dataset on the public repository: the data connected to the experiment's results, and metadata describing the music and where to find and purchase it, would be publicly accessible, while the actual audio files would be only available to registered user at C4DM. This category includes also those datasets obtained from other groups.

The DataFlow project addresses exactly this problem by providing two separate solutions: the DataStage shared virtual disk space, and the DataBank repository. As discussed in a previous post, DataFlow would have been an optimal solution for our needs, but unfortunately its "under development" status prevented us from adopting it.

Our current setup is somehow similar to DataFlow: we have our old private file server, and we have a repository (DSpace). There are of course differences between our system and DataFlow. First of all, DataStage and DataBank are closely integrated to make it easy to publish data directly from DataStage to the repository. And second, the user of DataStage is given three data sharing options (private, shared, and collaborative).

As mentioned before, we are thinking to test DataStage in combination with DSpace, using SWORD to publish datasets (when SWORD support is implemented in DataStage). Anyway, managing the "work in progress" data is important but less crucial than finding a good solution to manage complete datasets containing restricted data.

When we adopted DSpace, we planned to use access control (users and groups) to hide certain data, while still showing its metadata. In theory, this should work in DSpace by setting authorisation policies for Communities, Collections, and Items. However, in practices the way this is implemented does not work as we would expect. Authorisation policies are not inherited by a Collection or a Sub-community in a Community. Items in a Collection inherit the authorisation policies, but this is not retroactive. This causes two problems. First, when creating new Collections, one has to manually select the appropriate policies. Second, and most importantly, it is not always obvious, as it is for a normal file system, what the policies for a certain object are.

As an example, let's assume we have a community with restricted access: if we try to access its description, we are prompted to login. Within that community we create a collection and forget to restrict the access to that as well (since it is not inherited). We then create an item in this collection, which will also be public. If we look at the list of Communities and Collections, and click on the (mistakenly) public community, we can access it without logging in, and the same applies to the item.

The question is then: what's the point of hiding the information about the community, if everything in it can be accessible? We would like to have a system where we create a private community, and all in that community is only accessible by authorised users. Or at least, only metadata is accessible by anonymous users, but data is not. As a sub-optimal solution, we are planning to implement two separate instances of DSpace, one completely public, and one completely private (i.e. only accessible while on the QMUL network). As a related issue, see Steve's post on authentication.