File Formats & Versioning

Choosing the right file formats for long-term preservation

If researchers want to preserve research data on the long term, they should seek to improve the technical availability of their data by converting data to endurable formats which are supposed to be accessible 5, 10 or 20 years from now. For simple data matrices, text files fulfil this requirement (if they are accompanied by a codebook that makes the data interpretable). Since such formats are not available for all kinds of data, some repositories ensure that data are migrated to new formats if the
old format is outdated.

Besides these rather general considerations there are also some recommendations and best practices for specific data types in psychology. For biopsychological data formats the following standards can be used:

  • Brain Imaging Data Specification for fMRI data (introductory article)
  • Best Practices in Data Analysis and Sharing in Neuroimaging using MRI for MRI data
  • DICOM (Digital Imaging and Communications in Medicine) standard for storing and transmitting of information in medical digital imaging management
  • Empfehlungen zur Erzeugung und Dokumentation von EEG Daten der Deutschen Gesellschaft für Klinische Neurophysiologie und Funktionelle Bildgebung (only German)
  • Recommendations on creating and documenting EEG data (in german) by the German Association for Clinical Neuro Physiology and Functional Imaging (DGKN) for EEG data
  • European Data Format (EDF) for EEG data

For behavioural psychological research data you can rely on the BIDS Standard (Gorgolewski et al., 2016).

Versioning of research data

Versioning or version control means saving changes and keeping record of changes in (data) files. Whenever a file changes, a new copy with a new version number should be generated. This allows to recourse to older versions at any time as well as reconstructing the development of a data file. Reconstructing the development of a data file is also often referred to as data provenance documentation and is an essential property of transparent, reproducible science. Versioning should follow a systematic course, that indicates for example under which circumstances a new version is created.

How to do versioning?

Depending on the complexity of your research data, you can apply the following procedures for data
versioning:

  • Defining Milestones. When a predefined milestone is accomplished (e.g. data file: input of all collected data), a separate milestone version of the file (master-file) is created. For this master-file, copies in different formats (e.g. csv, xml, sav) should be generated and archived. The generation of a checksum can be an additional safety measure, see data integrity.
  • Using Sub-Versions. Sub-versions denote small changes that have been made in one work day, while major versions are milestone versions or particularly important updates. Sub-versions do not need to be saved in various formats with checksums.
  • Instructions (e.g. as a readme file) on required changes of other files as a consequence to changing/updating one file.
  • Determining specific dates when to validate and if needed harmonize the data files. Such a date could be for example prior to reaching a milestone.
  • Add a change log. Adding a change log to each data file which describes changes in the latest version.
  • Using collaborative working environments. Specialized software or functions of program can be very handy when it comes to working collaboratively on documents, conducting version management or synchronizing folder content. The most prominent example for a versioning software is GitHub, which is widely used in software development.
  • Creating safety copies regularly. This also includes controlling access to these safety copies.
  • Publication related data storing. In Psychology, primary data should never be altered (i.e., transformed, aggregated, recoded). If you publish an article, you should be able to publish the raw data along with syntax files that reproduce your final results (Schönbrodt, Gollwitzer & Abele-Brehm, 2016).

References

Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., … & Poldrack, R. A. (2016). The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific data3(1), 1-9.

Further Resources and Tools

File Formats

Recommended formats for other data types considering their appropriateness for long-term preservation can be retrieved from the UK data archive.

The Digital Preservation Handbook, 2nd Edition, https://www.dpconline.org/handbook provides a comprehensible introduction on this issue.

Versioning

Further information on data versioning are provided by the Australian National Data Service, and the UK Data Archive.

The DANS’ guide on data preparation for data sharing in the social sciences does also provide a
comprehensive introduction to data versioning related issues