Which information should be included in a data documentation? -

Planning the documentation of research data generally means to decide about the metadata, i.e. data which help to give meaning to a data object, that need to be documented with regards to a given data collection process. Information about which metadata should be created can be received from different sources. On the one hand the majority of data archives have their own set of mandatory metadata that have to be provided by the data depositor. In most cases this will be administrative metadata, i.e. information on managing a resource, such as when and how it was created, file type and other technical information, and access rights. On the other hand, metadata reporting standards, like the Data Documentation Initiative (DDI)-standard for the social sciences, can be used to document other relevant structural and descriptive metadata. An appropriate approach to deciding what metadata to report for an understandable description of research data is to rely on existing standards for data documentation as well as general standards for reporting. Below you find a list of standards that can offer guidance on how to document your data, data collection procedures, study design, measurement instruments or interventions. Moreover, the codebook as an inevitable component of data documentation and data sharing is described in more detail.

Data Documentation Standards

BIDS (Brain Imaging Data Structure; Gorgolewski et al., 2016)
EEG-BIDS (Pernet et al., 2019)
MEG-BIDS (Niso et al., 2018)
iEEG-BIDS (Holdgraf et al., 2019)
aDWI-BIDS (Gholam et al., 2019)
genetics BIDS (Moreau et al., 2020)

Reporting guidelines for (Experimental) Studies

American Psychological Association’s Journal Article and Reporting Standard (JARS) is currently the most relevant standard for reporting psychological studies. It incorporates detailed information on how to describe methodical parts of a paper.
The APA publication manual which is wider in scope than the JARS can also serve as guidance for your documentation.
The CONSORT statement is probably the guideline which has the greatest impact in the field of health sciences. CONSORT stands for Consolidated Standards of Reporting Trials and it gives evidence-based recommendations for reporting randomized trials. It entails a 25-item checklist (report design, analysis and interpretation of the trial), as well as a flow diagram (flow of all participants through the trial). Various extensions of the CONSORT statement are already available or under development.
Additionally, CONSORT-SPI, an extension for randomized controlled trials of social and psychological interventions is currently under development.
Study Design Reporting: The SPIRIT statement, which provides checklists on various aspects of clinical trial protocols, may serve as guidance for reporting your study design.
Pre-Registration: van’t Veer und Giner-Sorolla (2016) published recommendations on the pre-registration of studies in social psychology. Additional materials are available on the corresponding OSF project. Moreover, a task force composed of members of the APA, BPS, DGPs, the COS, and ZPID developed a preregistration template, which can be accessed vis the PreReg platform at ZPID.
MRI Data: Guidance by the Organization for Human Brain Mapping (OHBM) Committee on Best Practice in Data Analysis and Sharing (COBIDAS)
EEG Data: (German) Empfehlungen zur Erzeugung und Dokumentation von EEG Daten [Recommendations on the Generation and Documentation of EEG data] of the Deutsche Gesellschaft für Klinische Neurophysiologie und Funktionelle Bildgebung. EEG-BIDS (Pernet et al., 2019)
Measurement Instruments: The RatSWD – German Data Forum published a working paper on Quality Standards for the Development, Application, and Evaluation of Measurement Instruments in Social Science Survey Research
Meta-analyses: JARS also provides guidance on reporting meta-analyses.
A JARS adaption for qualitative data

Codebooks

Codebooks are an inevitable component of data documentation and data sharing in the social sciences. In general, they describe the contents, structure, and layout of a data collection. Following the ICPSR’s information on codebooks (n.d.), core components of a codebook are:

Variable name. The name of a variable should only consist of letters, integers and underscores. Note that programs differ on allowed length, symbols that are supported, and on distinguishing between upper and lower case letters. You should provide a ReadMe on naming conventions that were used. An example for an elaborated naming convention is the naming convention of the GESIS Panel: its assignment rules ensure that every variable name is unique, easily identifiable and meets archive standards (restriction to a length of max. 8 tokens (digits or letters) and no combination of upper and lower case letters
Variable label. A short description or the full name of a variable. For example, if the variable name was BDI_Q1_T1 the full name could be Becks Depression Inventory, Question 1, Baseline.
Variable type. There is no fixed scheme for describing the variable type. At least, you should distinguish between (a) numeric variables (e.g. 5-point rating scale, height, intelligence), (b) strings (any open text item) and (c) dates.
Valid values. The set of valid values, which were used to code categories, for nominal and categorical variables. For continuous variables, a definition of the range of valid values should be given (e.g. by assigning value labels to minimum and maximum). To indicate that value labels were not assigned by accident, we recommend to assign value labels to all valid values that are listed.
Value labels provide information on how to interpret valid values for nominal and ordinal categorical variables, as well as, information on how to interpret missing values for all types of variables.
Missing values. The set of values, which were used to code missing data. “Blanks” or “sysmis” values should not be used as missing values because it is not possible to discriminate between fields which were deliberately left blank (items that were not answered or are missing by design) and fields which were just skipped on data entry. Different kinds of missing values should be distinguished: e.g. missing by design (e.g. because some questions were only asked in the control group), not applicable (e.g. pregnancy for male participants), not answered. Therefore, you should assign different codes to these missing value patterns and, subsequently, value labels to these codes. It is important to standardize missing values (i.e. there is one code for each kind of missing value which is consistently used throughout your dataset). In some cases, it may be useful to define a range of missing values. Defining a range of missing values (e.g. a missing value range that is defined as 6-99 for a 5-point Likert-scale) facilitates excluding wild codes (e.g. 55 instead of 5 because of typing errors) or measurement errors from analyses (e.g. measurements of heart rates that are higher than 220 beats per minute).

Extended Information

The following information should be included in either the variable label or in a separate attribute field if they enhance data intelligibility:

Variable item text/instruction. The exact wording of the questionnaire item, software instruction, etc. corresponding to the variable (in consideration of third party rights).
Measurement occasion. The measurement occasion for the variable (e.g. wave 1, pre-treatment).
Instrument. The measurement instrument to which the variable belongs.
Construct. The theoretical construct that is measured by a variable.
Unit of measurement. The unit of measurement for continuous variables (e.g. meter, seconds).
Response unit. The entity that provided information.
Analysis unit. The unit that is analysed in the variable. Note that response unit and analysis unit are not necessarily the same (e.g. parents providing information on their child’s behaviour)
Filter variable. Is this variable a filter variable? Depending on participants’ responses on a filter variable, a set of subsequent items/questions will be presented or not. For example, the variable “ marital status” is a filter variable if a set of questions is only presented to subjects who stated that they were married.
Imputation. If any kind of imputation took place this should be documented.

References

Gholam, J., Szczepankiewicz, F., Tax, C. M., Mueller, L., Kopanoglu, E., Nilsson, M., … & Beltrachini, L. (2021). aDWI-BIDS: an extension to the brain imaging data structure for advanced diffusion weighted imaging. arXiv preprint arXiv:2103.14485.

Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., … & Poldrack, R. A. (2016). The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific data, 3(1), 1-9.

Holdgraf, C., Appelhoff, S., Bickel, S., Bouchard, K., D’Ambrosio, S., David, O., … & Hermes, D. (2019). iEEG-BIDS, extending the Brain Imaging Data Structure specification to human intracranial electrophysiology. Scientific data, 6(1), 1-6.

Moreau, C. A., Jean-Louis, M., Blair, R., Markiewicz, C. J., Turner, J. A., Calhoun, V. D., … & Pernet, C. R. (2020). The genetics-BIDS extension: Easing the search for genetic data associated with human brain imaging. GigaScience, 9(10), giaa104.

Niso, G., Gorgolewski, K. J., Bock, E., Brooks, T. L., Flandin, G., Gramfort, A., … & Baillet, S. (2018). MEG-BIDS, the brain imaging data structure extended to magnetoencephalography. Scientific data, 5(1), 1-5.

Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A., & Oostenveld, R. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific data, 6(1), 1-5.

Rammstedt, B., Beierlein, C., Brähler, E., Eid, M., Hartig, J., Kersting, M.,…,Weichselgartner, E. (2015). Quality Standards for the Development, Application, and Evaluation of Measurement Instruments in Social Science Survey Research. RatSWD Working Paper Series. No 245.

Further Resources and Tools

Arslan, R. C. (2019). How to automatically document data with the codebook package to facilitate data re-use. Advances in Methods and Practices in Psychological Science, 2(2), 169–187. https://doi.org/10.1177/2515245919838783

Other guidelines can be retrieved from the equator network which maintains a directory of Guidelines on health sciences with more than 280 entries.

FAIRsharing.org curates information on inter-related data standards, databases and policies (in life, environmental and biomedical sciences).

The RDA Metadata Standards Directory Working Group is composed of individuals and organizations that are involved in developing, implementing and using metadata for scientific data.

Information on the codebook is based on the ICPSR website on codebooks and the corresponding information in the PsychData Manual.