Representation Information for Crystallography Data, JISC eCrystals Federation Project, WP4: Repositories, Preservation and Sustainability

Manjula Patel

    Research output: Book/ReportCommissioned report

    70 Downloads (Pure)

    Abstract

    The eCrystals Federation project is concerned with setting up a federation of institutional repositories for the management and dissemination of the raw, derived and results data from crystallographic experiments [1]. It builds on the work of the eBank-UK project [2] which developed and implemented the eCrystals repository [3], focusing on the workflows of the laboratory based experimental technique of chemical crystallography undertaken at the EPSRC National Crystallography Centre (NCS) based in Southampton. Following the creation of a completed crystal structure determination, data is uploaded into eCrystals and supplemented with chemical and bibliographic metadata. A subsequent scoping study, undertaken as part of phase 3 of the eBank-UK project identified several issues pertinent to the curation and preservation of crystallography data [4], amongst them was the importance of the concepts underlying the OAIS Reference Model [5] and its associated notion of Representation Information (RI).
    Consequently, this report is concerned with an investigation of RI for crystallography data and its role in the curation, maintenance and management of such data. We begin with a brief overview of aspects of the OAIS Reference Model [5] which establishes a conceptual framework of terms and components for use in the preservation of information. The Model also identifies the environment within which an OAIS operates as well as its basic functions in the form of functional, information and information flow models. The notions of a Designated Community (DC) and its associated Knowledge Base (KB), as well as RI are also defined. RI is any information required to render, process, interpret, use and understand data; for example, it may be a technical specification, or a data dictionary or a software tool. To preserve digitally encoded information over the long term the OAIS Model requires that information remain accessible, understandable and usable by a specified DC. A DC is a group of users or consumers for whom the data is being maintained. Within the OAIS Model, intelligibility of the data by the DC is of paramount importance and RI is a key concept in achieving this [13]. The Model identifies three main types of RI: structural, semantic and other.
    In section 3, we describe the development of a registry/repository of RI (RRoRI) [18] which aims to make relevant RI available in a readily accessible manner to third parties. The work is heavily based on the ideas in the OAIS model; it centres on the notion that RI is critical to the long-term access of digital information [19], [20]. The current implementation of RRoRI is based on the use of standards (ebXML) and freely available registry/repository software (freebXML) with its associated JAXR interfaces. As explained in 3.2, access to RI by third parties is enabled through the use of two key concepts: Curation Persistent Identifiers (CPIDs) and descriptive RI labels [24].
    The crystallography domain and the workflow of the NCS are then examined in order to identify significant RI. Procedures at the NCS indicate that a number of well-defined, sequential stages are readily identifiable. At each stage, an instrument or computational process produces an output, saved as one or more data files which provide input to the next stage. The output files vary in format, they range from images to highly-structured data expressed in textual form. We have found that the Crystallographic Information File (CIF) format is central to working with contemporary crystallography data as well as maintaining access to its information content in the future. CIF is used as a publishing format; as well as being structured and machine-readable, it is capable of describing the whole experiment and modelling process.
    As a result, section 6 clarifies the relationship between various types of CIF RI, including structure (file format specification), semantic (data dictionaries) and other (software). These relationships are then used to develop an RI Network for the CIF format. Section 7, goes on to describe an ingest tool which allows RI to be input into RRoRI, as well as describing the crystallography RI that has been submitted to RRoRI so far. A simple use case scenario, in section 8, describes how the RI stored in RRoRI may be used in order to gain access to the information content of a CIF instance by someone unfamiliar with that file format.
    We conclude with a discussion of the role of RI in curating and maintaining access to crystallography data and pointers to further work:
    - The range and quantity of RI required for even a simple collection of data is potentially enormous. It is therefore practical to develop a collaborative and shared approach to the problem. It would benefit the whole community if service providers and developers of work-up software (e.g. SHELXS, SHELXL, XPREP) were to provide and maintain comprehensive descriptions of their file formats; also the export of raw data in the draft standard imgCIF/CBF (Crystallographic Binary Format) [36], by crystallographic instrumentation software is recommended.
    - Explicit recording of relevant RI in a central and managed registry/repository such as RRoRI ensures that the CIF file format can be understood well into the future by those working across different disciplines as well as providing intelligible long term access to crystallographers.
    - In order to associate an RI Network with the CIF files stored in the eCrystals repository, it would be necessary to record a CPID in the metadata record for each CIF instance file. This CPID would act as a point of entry into RRoRI by pointing to an RI label.
    - It is likely that RI in itself may not be sufficient to guarantee effective access and reuse of digital data in the future; additional metadata such as reference, provenance, context and fixity information will also need to be recorded and maintained.
    - RI will itself need to be curated and maintained to provide trusted, authoritative and secure RI that allows users to rely on its authenticity and integrity; this could perhaps be overseen by the DCC.
    - Long term curation of the contents of a registry/repository of RI would have to be guaranteed through adequate sustainability and succession planning, perhaps with an organisation of guaranteed longevity such as the NARA, The National Archives or The British Library.
    - An alternative to relying on a generic, central registry/repository is for the crystallography discipline to develop its own RI registry/repository maintained by the community or a body such as the IUCr. Such a registry/repository would form part of a global and distributed network of RI. The web pages currently maintained by the IUCr, while certainly providing up-to-date information on the CIF file format, are at present suitable only for human access. A registry/repository modelled on the RRoRI would cater for automated machine processing.
    - Furthermore, we can envisage that registries/repositories of RI would have a valuable role to play in the shared services infrastructure part of the JISC Information Environment, helping to provide convenient access to data for both research and learning.
    Original languageEnglish
    PublisherNational Crystallography Service, University of Southampton
    Number of pages22
    Publication statusPublished - 19 May 2009

    Bibliographical note

    Report originally published on eCrystals Federation Project wiki

    Keywords

    • crystallography data
    • digital curation
    • preservation

    ASJC Scopus subject areas

    • General Chemistry
    • General Computer Science

    Fingerprint

    Dive into the research topics of 'Representation Information for Crystallography Data, JISC eCrystals Federation Project, WP4: Repositories, Preservation and Sustainability'. Together they form a unique fingerprint.

    Cite this