Connecting DataFlow-DataStage and DSpace

One of the goals we set for our project's two-months extension was to explore the use of DataFlow's DataStage in combination with DSpace (see this post). Now that a beta version of DataStage has been released, and the SWORD2 support has been implemented, we gave it a try and, with some tweaks, we managed to publish a dataset from DataStage on DSpace. This is a short account of the steps and tweaks involved.

Setup

  • DSpace 1.8.2
  • DataStage 0.3rc2 running locally in VirtualBox (MacOSX host)
  • Tweaks

    • VirtualBox (this is related to the fact that the DataStage virtual machine is running locally, if you put it on a "real" server this won't probably apply, as you would access it from a proper network). To be able to access the Guest machine from the Host, I added a second network adapter to the virtual machine of the type "Host-only", because the default one (NAT type) works only to connect the Guest with the internet. Also, the firewall on the Guest (DataStage) needs to be manually tweaked (I couldn't get this working by running the automatic datastage-config). For simplicity, I just turned the firewall completely off (I'm only working locally for now). The firewall otherwise is blocking SMB and SSH, although the web interface works in any case. Observe that I haven't tried to access DataStage from another computer, only from the Host.
    • DataStage. Apparently, DataStage doesn't load correctly when my virtual machine boots, which prevents the SWORD2 client from sending files (the background thread that takes care of this doesn't start). The solution is simply to manually restart DataStage:

      datastage-server stop
      datastage-server start

      Took me two days to figure this one out!

    • DataStage + SWORDv2 + DSpace. The dataset submission from DataStage is performed in two stages: first the entry is created with basic metadata (Atom entry), and then it is completed by adding the data files (entry update). By default, the SWORD client sends the header "in_progress = False" when creating an Atom entry in a repository. With our DSpace setup, this causes the entry to be immediately archived. And because the entry is archived, the server rejects the second operation (update with files). I thus changed the header sent by DataStage to "in_progress=True" (in the file /dataset/sword2depositor.py). It was pointed out to me by Richard Jones, one of the SWORD developers who also worked on the SWORD related issues of DataStage, that this might not be necessary if DSpace has a different workflow in place that doesn't immediately archive an entry. We at C4DM chose this approach because we want the users to go to DSpace and complete and review their submission with all the required metadata (not included in the SWORD package), and also choose a license (not implemented right now in SWORD). This is the same workflow we use with our sworduploader utility (see here). This DataStage tweak works for now, but we haven't tried the DSpace tweak alternative.
    • DSpace ingester. The DSpace SWORDv2 server configuration must be changed to accept the http://dataflow.ox.ac.uk/package/DataBankBagIt packages sent by DataStage, and an ingester must be assigned to this task. Since at the moment the DataBankBagIt is a simple zip archive with the addition of the manifest.rdf file containing some basic metadata, a specific ingester is not required: the SimpleZipContentIngester can be used. Nevertheless, I created a new ingester and called it SwordDataBankBagItIngester: I repackaged the SimpleZipContentIngester, and modified it to take care of the manifest.rdf. Currently, I completely ignore it (I don't even add it to the bitstreams). In the future, I might consider parsing the manifest file to extract metadata and update the entry.

      Only a couple of lines must be added to the swordv2-server.cfg in DSpace:

      accept-packaging.item.DataBankBagIt = http://dataflow.ox.ac.uk/package/DataBankBagIt

      and

      plugin.named.org.dspace.sword2.SwordContentIngester = \
      ...
      org.dspace.sword2.SwordDataBankBagItIngester = http://dataflow.ox.ac.uk/package/DataBankBagIt, \ ...

      or, if we want to use the SimpleZip ingester:

      plugin.named.org.dspace.sword2.SwordContentIngester = \
      ...
      org.dspace.sword2.SimpleZipContentIngester = http://dataflow.ox.ac.uk/package/DataBankBagIt, \ ...

    DataStage is still in beta, and the interface needs some polishing, but the core functionalities are there, and the fact that SWORD2 works almost out of the box is very promising!