Easily create DSpace submissions with many files

One of the limitations of DSpace pointed out by our test users is the way the web interface manages the upload of bitstreams (i.e. files) during a the submission process. We were aware of this fact already from the beginning, and tried to find alternative solutions.

The standard DSpace submission process is divided into a number of steps. First, the submitter creates a submission by filling in some basic metadata. Then, a file upload page is used to add files to the dataset. This page has two main limitations: firstly, we can only upload one file at a time; secondly, there is no progress bar to indicate the status of the upload. Obviously, if a dataset contains more than a few files, the submission process becomes very cumbersome. We tried, as a sub-optimal alternative, to upload a ZIP archive containing all the dataset files. However, this solution prevents users from seeing which files are in the dataset and downloading them one by one. Furthermore, the ZIP file can be very large, and a few times the upload process failed, leaving the submitter frustrated.

We were aware of these limitations from the beginning, and identified a number of alternative solutions to make the upload process more streamlined and user friendly.

DSpace packager plugins: DSpace supports importing items in the form of packages, usually zip files containing a manifest file (XML with descriptive metadata and the list of files in the item) and the dataset files. Data files do not necessarily have to be in the package, if the manifest file contains references to where they are located. The plugins support different package formats. Unfortunately, two factors limit the use of these plugins for our purposes:

  • Creating the manifest file can be difficult, and requires good knowledge of XML and of the standard package formats (e.g. METS). Creating the XML file can potentially be as cumbersome as uploading one file at a time through the web interface.
  • Packager plugins are administrative command line tools. Access via SSH to the server running DSpace, and administrative privileges are required to run the commands. Our goal is to create a "self-service" system, with no dedicated data manager/curator who would import dataset in this way, but giving administrative rights to all users is clearly not acceptable.

An alternative approach is SWORD:

"SWORD is a lightweight protocol for depositing content from one location to another. It stands for Simple Web-service Offering Repository Deposit and is a profile of the Atom Publishing Protocol (known as APP or ATOMPUB). SWORD has been funded by the Joint Information Systems Committee (JISC) to develop the SWORD profile and a number of demonstration implementations." (from http://swordapp.org/about/).

The latest version of DSpace (1.8) officially supports ingestion using both SWORDv1 and SWORDv2 (two different plugins). The SWORDv2 protocol is an extension to SWORDv1 that allows to not only create an item, but also modify and add to existing ones. We identified four solutions that use the SWORD protocol, and have been testing them with our repository.

A multi-platform Desktop Client solution appears to be a simple, user-friendly solution. For example, our colleague Chris Cannam, for the sustainable software development project SoundSoftware developed EasyMercurial, a desktop client to simplify the use of a Mercurial repository.

A Java-based SWORDv1 desktop client is available at http://sourceforge.net/projects/sword-app/files/. We successfully ingested the test package available with the client using the SWORDv1 DSpace plugin, but failed with the SWORDv2 one, although it should be back-compatible. We then proceeded to upload a single file (not packaged), with no luck. Using packages brings us back to the packager plugins, although without the need for administrator tools.

We then considered the possibility to develop a simple client ourselves. Ideally, it should provide an easy way to create a manifest file by filling in the required metadata, and by adding bitstream files with simple drag and drop. The client should than create a package and send it to the repository for ingestion. SWORDv2 client toolkits implemented for several programming languages are also available from the sword app.org website (JAVA, Ruby, Python, PHP).

We tested the Python modules, but again with mixed results. Using SWORDv2, we managed to retrieve the Service Document which describes the repository, and create a metadata-only entry. Again, we failed when trying to upload bitstreams. The main difficulty we encountered is to understand where the problem lies: in the client implementation, in the DSpace server implementation, or in our DSpace configuration. We are planning to investigate this further.

Another interesting approach we were keen to try was the WatchFolder developed by the DepositMO project. This is a Perl script that monitors a folder on the user's computer. When new files are added to the folder, the script either initiates a new submission or updates an existing one. The script was originally designed to work with EPrints, but it should work with DSpace as well. We tried the script and after some help from the developers, we managed to at least connect to DSpace and create an entry, but again it wasn't possible to upload bitstreams. As mentioned in the comments, the developers are going to test the script more extensively with DSpace, and hopefully we will be able to use it for our purposes.

On the lines of DepositMO's WatchFolder is our previously mentioned idea to use DataStage from the DataFlow project to store work in progress data, and deposit them when ready to DSpace instead of DataBank, again using SWORD. When DataFlow released the beat version of DataStage, we managed to install it successfully from the Debian packages. Nevertheless, at the moment SWORDv2 support is still under development.

Finally, another idea we have not yet tested is to use the web interface to upload a zip file, and then run a script on the server to unpack it and store each single file separately. This would require us to first solve the the progress bar issue, and the problem with uploads of large files.

To summarise, several options are potentially available to make depositing files easy and more streamlined, but at the moment none of them seems to work properly. It is difficult to understand if the issues are related to our DSpace configuration, or problems in the various implementation. The fact that all the code is free, under continuous development, and used by very few people makes it even more difficult.