SWORD submission tools for DSpace

This is a follow-up post to our previous considerations about multi-file submissions to DSpace using the SWORD standard protocol. After getting in touch with the SWORD developers through the sword-app-tech mailing list, we managed to create a command-line tool that allows us to easily submit large numbers of files to DSpace. Furthermore, we added basic support for BagIt packages to the dspace-swordv2 server.

As I mentioned in the previous post, "several options are potentially available to make depositing files easy and more streamlined, but at the moment none of them seems to work properly." In particular, we put some work into creating a command-line client tool based on the swordv2 python module, but we couldn't figure out why it failed to ingest our files. Thanks to the help from the SWORD developers, we finally managed to build what we had in mind.

The script is called sworduploader.py, and the code is available on the project's code repository at code.soundsoftware.ac.uk. The script has also been packaged as a stand-alone application for both Windows and MacOSX. We chose this approach because we wanted users to be able to just download the program and use it, without the need to install Python and all the dependencies (NOTE: we forked the swordv2 python module project on BitBucket because there were some compatibility problems with DSpace's service document. Hopefully our changes will find their way into the main branch of the project).

The script's basic usage is:

python sworduploader.py data

where data can be a single file, a directory, a simple zip file, a METSDSpaceSIP package, or a BagIt package. The script will recognise what type of submission is required, and send the correct headers to the server.

Optional arguments are:

--help

--username USER_NAME: DSpace username.

--servicedoc DSPACEURL: Url of the SWORDv2 service document. This will override the server specified in server.cfg. If this file is missing, it will default to C4DM's DSpace repository.

--zip: If "data" is a directory, it is first compressed and then sent as a single zip archive. The zip file is saved along with the individual files as a bitstream.

It is also possible to specify some basic metadata. This is especially useful when we are not using packages containing metadata.

--title TITLE: Title (ignored for METS packages).

--author AUTHOR [AUTHOR ...]: Author(s) (ignored for METS packages). Accepts multiple entries in the format "Surname, Name".

--date DATE: Date of creation (string) (ignored for METS packages).

The script connects to the server, and after successful authentication, it downloads the service document and prints a list of the available Collections. The user is prompted to choose the Collection in which the submission should be made.

If the dataset is submitted successfully, the submission remains "open" and must be completed with the necessary metadata and licenses through the DSpace web interface ("My Account -> Submissions", and then "Resume"). This way, the user will have to go through the online metadata submission form and the License choice form, thus assuring that all the required steps are fulfilled. The METSDSpaceSIP packages are a special case: the SWORD server ingester allows to submit these packages only to Collections with a Workflow.

Note that to be able to ingest BagIt packages, we developed a basic SwordBagItContentIngester module. The ingester opens the Bag, verifies the checksum of the files before creating a bitstream in the item, and then adds optional metadata. We have been using the optional "bag-info.txt" file to pass metadata to the ingester in the form: "namespace-term-qualifier: value". If the metadata field is not registered in DSpace, it is ignored. We successfully tested the ingester with BagIt packages created with the Bagger application.

To manage BagIt packages, we tried to use the Bag class available in the optional DSpace Replication Task Suite, as suggested in some DSpace forums. This class only implements parts of the BagIt specifications, so we extended it with a few extra methods. Unfortunately we couldn't achieve this by simply subclassing the Bag class, so we created a new class SwordBag which incorporates the code from the Replication suite. We will publish this code as well, and hopefully, if considered useful, it will be included in future DSpace releases. The code will be made available on our code repository once it is ready. In the process of developing our script, we also discovered a couple of small issues with the SimpleZip and BinaryData content ingesters, which have been reported to the developers.

The work on the BagIt ingester had also the longer term goal of allowing us to connect DataStage to DSpace, so we are looking forward to the next DataFlow release which should happen on April 24.