Installing DSpace from a command-line on CentOS 6

Having evaluated possible data management solutions and selected DSpace as our solution, we need to allow some users to examine DSpace and consider how it will fit in with their workflow. The initial evaluation took place on a desktop PC with the data management server running as a virtual machine. In order to allow further evaluation we have now set up a “proper" server environment to test DSpace – this server has a command-line interface and no GUI, requiring a more “managed" approach to setting up the system.

Warning: this is a technical post, which is offered in the hope that it may be of assistance to anyone else installing DSpace.

DSpace is a Java application and requires:

  • an suitable Java servlet engine (Tomcat in our case, other options apparently include Jetty or Caucho Resin);
  • a relational database to store metadata and application data (we're using PostgreSQL, Oracle is an option);
  • a JDK (Oracle Java SE 6.0);
  • Maven to build DSpace according to the local configuration and to download dependencies – this is required even if the DSpace binary release is used;
  • ant to install the local build of DSpace;
  • disk space to store whatever data it is being used to manage.

The Server

Our CentOS 6 server was pre-installed with PostgreSQL and Tomcat. Four CPU cores are available to the server, and 4Gb of RAM. The server was provided with a minimal amount of virtual disk space available, but with the option to extend this as required – producing a minimal/slim server to meet our needs. CentOS uses yum to manage installation of standard software packages.

yum install package

Configuration Management

As the server doesn't include a GUI, it is necessary to download some files from the web and then to pass them through to the server in a controlled manner. To this end, we have used Mercurial repositories on the code.soundsoftware.ac.uk site to store the components required to install DSpace. Installing Mercurial on the server (via yum) then allows us to clone the repositories and “pull" software updates to the server. This approach was used for: DSpace itself; the Oracle JDK; and Maven (see below).

In order to copy the repository we need to:

yum install mercurial
hg clone http://path.to.mercurial.repository/project/subproject

In order to update the local copy of the project with any changes to the repository, we need to:

hg pull
hg update

Installing Maven

The Maven package is not in the default yum repositories for CentOS 6. We therefore needed to point yum at the JPackage repositories. NB: As of January 2012, the version 6.0 repositories were a work-in-progress – the JPackage “current" release is for "version 5" Linuxes. Adding the JPackage repositories allows yum to find maven2 and its dependencies.

Maven, and the related dependencies, were then installed using:

yum install maven2.

This then required 232Mb of space in /usr to update CentOS. Which wasn't available, but a quick email to the systems people solved that.

Installing Maven also installs ant.

Sadly the version of Maven that this resulted in was Maven 2.0.8 and when we came to attempt to install DSpace, one of the dependencies required Maven 2.0.9. However, having downloaded Maven 2.2.1 it was actually very straightforward to configure it based on the Maven version 3 installation instructions (at the bottom of the download page, after the download links). Alternatively, editing /usr/bin/mvn to contain the correct setting for M2_HOME allows Maven 2.2 to be used. Note that having installed Maven 2.0.8, relevant dependencies had already been installed making the upgrade painless.

The Maven 2.2.1 distribution was downloaded on a desktop machine and uploaded to a Mercurial repository.

Configuring PostgreSQL

Following the DSpace installation instructions, PostgreSQL was configured to allow local TCP/IP connections – by editing /etc/postgresql/8.4/main/postgreql.conf – and some security was added by editing pga_hba.conf. However, after doing this I was unable to create the dspace database using the suggested command-line as the increased security prevented the given command from working. In order to create the database, the parameter -h localhost needed to be specified. This option causes PostgreSQL to use TCP-IP (and therefore the "host" configuration) rather than named pipes (and the "local" configuration).

The full command-line is then:

createdb -U dspace -W -h localhost -E UNICODE dspace

Configuring Tomcat6

The Tomcat XML configuration file (/etc/tomcat6/server.xml) was edited as specified in the DSpace installation instructions - e.g. to remove upload timeouts – and Tomcat was restarted to check that the configuration was valid.

service tomcat6 restart

In case of errors (e.g. typo's in the configuration) log files in /var/logs/tomcat6 (and specifically catalina.out) contain error messages.

Using a standard Tomcat installation and a tomcat user with no login shell, attempts to start Tomcat failed with a "This account is currently not available." message. However, the tomcat user shouldn't have a shell available as it's not a login account. Updating the tomcat6 command in /etc/init.d/tomcat6 to specify the shell when invoking Tomcat fixes this.

Installing Oracle JDK

The Oracle JDK was downloaded on a desktop machine and uploaded to a Mercurial repository. This repository was cloned on the server and the JDK was installed based on instructions found online. However, by using the jdk-...-.rpm.bin the third step was simplified to simply running the downloaded file.

Again, more disk space was required - 232Mb needed, so another 250Mb configured.

Building DSpace

Finally, we get to look at installing DSpace itself. Mainly, this involves following the instructions. The dspace-1.8.1-release.tar.gz archive file was downloaded on a desktop PC and pushed to the Mercurial repository on soundsoftware.ac.uk. This file was then pulled from the repository to our server. On the server, we extracted the archive to a dspace-1.8.1-release folder. Inside that folder is a set of text files (README, LICENSE and NOTICE) and a single subfolder - dspace, the DSpace installation folder. From here on, we'll refer to our dspace-1.8.1-release folder as [dspace-source].

The DSpace configuration file ([dspace-source]/dspace/config/dspace.cfg) needs to be updated to point to the base folder for the DSpace installation. Additionally, server details and the PostgreSQL username and password need to be specified. Lots of other things can be configured – but at the moment we're after a fairly “vanilla" DSpace which we will tweak as the testing progresses.

The DSpace 1.8.1 "binary release" uses a Maven build procedure to install all the dependencies. Moving into the DSpace installation folder, the pom.xml file contains the Maven project definition. In that folder, running mvn package builds the DSpace application, putting the output into a target subfolder.

In our case, the install procedure didn't create the necessary subfolders for the various modules of the application. The following folders needed to be manually created to successfully run the DSpace Maven build:

  • [dspace-source]/dspace/modules/xmlui/src/main/webapp
  • [dspace-source]/dspace/modules/lni/src/main/webapp
  • [dspace-source]/dspace/modules/oai/src/main/webapp
  • [dspace-source]/dspace/modules/jspui/src/main/webapp
  • [dspace-source]/dspace/modules/sword/src/main/webapp
  • [dspace-source]/dspace/modules/swordv2/src/main/webapp

If the DSpace build fails (e.g. through lack of disk space), deleting the target folder allows a clean build to be performed – i.e. in the folder [dspace-source]/dspace the command rm -rf target.

Having used Maven to build DSpace and copy any dependencies, DSpace is then deployed to the release folder from [dspace-source]/dspace/target/dspace-v.v.v-build using ant:

ant initial_install

If the ant initial_install fails, dropping the PostgreSQL database and deleting any installed files will allow it to be retried. For this reason, it is sensible to make sure that the folder in which DSpace will be installed is reserved just for DSpace (rather than also containing the build folders). In our case, we installed DSpace in /usr/dspace/system from another subfolder of /usr/dspace.

Having again run out of disk space, we could clear out the aborted install using:

rm -rf /usr/dspace/system
su postgres
dropdb dspace

Then, recreating the dspace database allows the install to be rerun.

createdb -U dspace -W -h localhost -E UNICODE dspace

Rather than setting Tomcat to use the dspace user, we gave the tomcat user ownership of the DSpace folders – as DSpace is the only application running under Tomcat on the server this provides the equivalent result with minimal configuration.

chown -R tomcat:tomcat dspace

Additionally, the tomcat user is not a login user providing a greater degree of security. Admin users on the server were added to the tomcat group to allow use of the DSpace command-line.

usermod -a -G tomcat user

One issue raised by looking at this is whether the dspace user really needs write access to so much of the dspace system.

Conclusions
Installing DSpace on the server wasn't as straight-forward as it could have been. However, a couple of days work and some hunting around the internet has resulted in a working DSpace installation.

Some 794 Mb were required to build the binary release of DSpace with 297Mb occupied by the target DSpace built using Maven. Installing the target DSpace via ant resulted in an additional 321Mb for the final system. Maven and JDK also required large amounts of free space resulting in approximately 3Gb of free space being needed in /usr to install DSpace.

Next we will be examining how DSpace can be incorporated into research workflows at C4DM, and looking at how to integrate DSpace with the college's standard authentication mechanisms.

Coming soon: What Does DSpace Offer; Securing DSpace; DSpace Authentication; DSpace Customisation.