Data Integrity & Data Selection

Data Integrity

The term data integrity occurs in different contexts, it may refer to the consistency of information recorded within the digital data object or to the consistency of the digital data object itself.

Data integrity and Data Cleaning (consistency of information represented by data)

Common flaws that can impair quality and consistency of the information that is recorded within a dataset are wild codes (e.g. three different values assigned to the variable sex), values that are out of range (e.g. the value 9 for items with a range from 1 to 5), inconsistent (illogical) values or implausible values. Data cleaning can be regarded as all measures that are taken in order to ensure data integrity and prevent these common flaws mentioned above. Data cleaning procedures should be outlined beforehand.

Data integrity and Checksums (consistency of the datafile itself)

Data integrity may also refer to the consistency of a dataset meaning that no changes on the datafile occurred accidentally or due to transmission errors. Checksums can be used to create fingerprints of digital objects and ensure data integrity, because a dataset’s checksum changes if the dataset is modified. Thus, accidental changes or changes that are due to software/hardware faults become detectable. Hence, you should use checksums when generating copies of masterfiles, generating back-up copies or downloading files from repositories to verify that the copy and the original file are identical. Examples for checksums are SHA or md5. Checksum generators are freely available as web application or freeware.

Data Selection

The term data selection aims at choosing data that should be stored during data collection or that should be shared/archived after the project is completed. Data selection also refers in some contexts to the process of choosing datasets, which are considered worth long-term preservation by a data archive (e.g. selection criteria of UK data archive). This aspect will not be considered here.

Data Selection Decisions during a Research Project

During data collection, researchers have to define under which circumstances collected data should be stored or discarded. Typically, the principal investigator defines criteria for this purpose. But also other instances (e.g. the research institution) can be responsible for defining those criteria. As data selection procedures affect the resulting research data, they need to be thoroughly documented. Examples for data, that may be considered irrelevant (and can be discarded), are data based on incomplete runs or flawed codes. Additionally, personal data should be deleted as soon as possible in order to obey legal requirements, if no specific consent on keeping this personal data was obtained (see the knowledge base’s section on data privacy). Although data can, naturally, only be shared after data collection is finished, it is important to consider your plans on data sharing before your data collection starts. For example, you will have to prepare different workflows for storing and anonymizing data, if you obtained explicit consent to share personal data only for a subset of subjects.

Practical Guideline

Heiko Tjalsma and Jeroen Rombouts created practical guidelines for appraising and selecting research data on which the following information will be based on (also see the webpage of Research Data Netherlands for a more condensed checklist regarding this topic). In the following, some points are presented that should be considered for the selection process.

Selection criteria

primary vs. secondary data
- primary data are data in their original, unedited form (often those are also the raw data, that have not yet been changed by the researcher). Usually it is not (yet) common to publish the primary data, but those are needed for verification purposes, e.g. when it is necessary to reconstruct performed analyses.
- data become secondary data when researchers process or change the primary data (e.g. transform values, create specific values, etc.). These are often the data that are shared with others.
who makes the selection decision?
- institute: The data policy of a research institute may contain information regarding the goals, resources and legal obligations and may also offer information about which data to select for preservation/sharing.
- data repository: Also the data repository often has collection criteria which inform about which research data to preserve and the conditions which apply to this conservation.
- community: Also the members of the community which are interested in the data can influence the data selection process. Important factors on this level of data selection concern standardisation, legal or cultural aspects, as well as specific properties of the data, like open and permanent access and data format.
technical aspects
- which data formats, which software or hardware is used
metadata
- are the metadata sufficient and available? Which information do they contain? E. g., technical information, codebooks, information on the data structure and on intellectual property rights
which infrastructure is available to preserve the data?
- data archive
- institutional or thematic repository
- other?
costs of data selection
- how are the costs for selecting, converting, preserving and making the data available to be
  covered?

Further Resources & Tools

Data Integrity

Chapman, A. D. (2005). Principles and Methods of Data Cleaning – Primary Species and Species-Occurrence Data, version 1.0. Copenhagen: Report for the Global Biodiversity Information Facility.

UK data archive. Managing and Sharing Data. Best Practice for Researchers. Retrieved from: https://ukdataservice.ac.uk/learning-hub/research-data-management/store-your-data/backup/

Data Selection

DCC (2014). Five steps to decide what data to keep: a checklist for appraising research data v.1. Edinburgh: Digital Curation Centre. Available online http://www.dcc.ac.uk/resources/how-guides

Gollwitzer, M., Abele-Brehm, A., Fiebach, C. J., Ramthun, R., Scheel, A., Schönbrodt, F. & Steinberg, U. (2021). Data Management and Data Sharing in Psychological Science: Revision of the DGPs Recommendations. doi: https://doi.org/10.31234/osf.io/24ncs

Tjalsma, H., & Rombouts, J. (2010). Selection of Research Data; Guidelines for appraising and selecting research data. Retrieved from https://repository.tudelft.nl/islandora/object/uuid%3Adbab8a19-542a-4c4d-96b4-df8cc39333db