DSpace: metadata schemas and data submission

During the past month we have been busy customising DSpace to suit the publication needs of the Centre for Digital Music. In this post, I will talk about our approach to dataset descriptive metadata and the online submission process.

The most commonly used definition of metadata is "data about data". This definition refers to what is known as "descriptive metadata", information that describes the content of other data, for example the authors, date of publication, license, versions, and so on. A large number of so-called metadata schemas (i.e. lists of predefined attributes) have been proposed and used to describe different kinds of data. The main challenge with metadata schemas is to find the right balance between generalisation and interoperability (using the schema for as many different types of data as possible) and level of detail, which can help among other things in searching large collections, but makes the schemas very discipline/type-specific. Another aspect to keep in mind when choosing a metadata schema is the time it takes to collect and fill in all the values of the attributes. Ideally, as much metadata as possible should be collected together with the data, so that during its publication it is easily found and attached, but this rarely happens. Having to fill in too many fields could reduce the will to publish.

Metadata is an integral part of the process of submitting data to a repository. The online submission process in DSpace is divided into three parts: first, one is prompted to fill in a number of descriptive metadata fields, some of which are compulsory; then, files can be attached to the submission; and finally, a license for the published material is chosen and attached to the data. Out of the box, DSpace comes with a predefined metadata schema based on the Dublin Core Metadata Initiative that is very specific to publications. It is nevertheless possible to create customised metadata-entry pages in which fields from different schemas/standards can be mixed. By default, DSpace come with a number of Dublin Core fields predefined (starting with the prefix dc, e.g. dc.author, dc.title), but it is also possible to define new schemas.

Data produced and used by researchers in the digital music area are very heterogeneous and change relatively rapidly. As a consequence, using very specific metadata schemas is not very feasible, as it would require creating new customised schemas and metadata-entry pages for each type of dataset. This would also be a problem when defining the metadata fields to be indexed by the repository's search engine. We thus decided to use the DataCite metadata schema. This schema is very generic was specifically defined to describe datasets for citation and retrieval purposes. Although the Dublin Core schema is more widely used, the basic set of attributes is much more publication-oriented. The Datacite schema includes 5 mandatory properties (Identifier, Creator, Title, Publisher, and Publication Year), and 12 optional properties (Subject, Contributor, Date, Language, Resource Type, Alternate Identifier, Related Identifier, Size, Format, Version, Rights, Description). It also specifies a citation schema based on the mandatory properties of the form: Creator (PublicationYear): Title. Publisher. Identifier. This interesting article describes the advantages of using Datacite for data citation.

We thus added the Datacite metadata schema to the metadata registry (datacite.identifier, data cite.creator, ...) in our DSpace installation, and created a metadata-entry page based on it. It all worked out well, except for the fact that DSpace is using hardcoded DublinCore metadata fields to put define titles and authors in the lists of items, which then appeared without this information (although the datacite metadata was correctly stored and indexed for search). To avoid modifications to the core of DSpace, we decided to recreate the metadata-entry page by using Dublin Core attributes that correspond to the Datacite ones. This approach has also the advantage to make the metadata more easily sharable, since the Dublin Core standard is more widely used. Following this approach, DSpace now correctly visualises all the basic information. Only the following attributes were used, and converted to the Dublin Core attribute in parenthesis:

  • Identifier (dc.identifier.uri)
  • Title (dc.title)
  • Creator (dc.contributor.author)
  • Publisher (dc.publisher)
  • Publication Year (dc.date.created, where only the year is mandatory)
  • Subject (dc.subject)
  • Description (dc.description.abstract)
  • Resource type (dc.type)
  • Related Identifier, including related papers and references (dc.relation.(qualifier), where qualifier={isreferencedby;requires;isbasedon;ispartof;isreplacedby;replaces})

As previously mentioned, Datacite also defines a standard citation scheme, which can be easily generated from the previous information. Nevertheless, many prefer to specify their own citation scheme, so we added an extra field using dc.identifier.citation. If left blank, the system generates a citation in the standard Datacite format.