Codebooks are an inevitable component of data documentation and data sharing in the social sciences. In general, they describe the contents, structure, and layout of a data collection.
Core Components
- Variable name. The name of a variable should only consist of letters, integers and underscores. Note that programs differ on allowed length, symbols that are supported, and on distinguishing between upper and lower case letters. You should provide a ReadMe on naming conventions that were used. An example for an elaborated naming convention is the naming convention of the GESIS Panel: its assignment rules ensure that every variable name is unique, easily identifiable and meets archive standards (restriction to a length of max. 8 tokens (digits or letters) and no combination of upper and lower case letters
- Variable label. A short description or the full name of a variable. For example, if the variable name was BDI_Q1_T1 the full name could be Becks Depression Inventory, Question 1, Baseline.
- Variable type. There is no fixed scheme for describing the variable type. At least, you should distinguish between (a) numeric variables (e.g. 5-point rating scale, height, intelligence), (b) strings (any open text item) and (c) dates.
- Valid values. The set of valid values, which were used to code categories, for nominal and categorical variables. For continuous variables, a definition of the range of valid values should be given (e.g. by assigning value labels to minimum and maximum). To indicate that value labels were not assigned by accident, we recommend to assign value labels to all valid values that are listed.
- Value labels provide information on how to interpret valid values for nominal and ordinal categorical variables, as well as, information on how to interpret missing values for all types of variables.
- Missing values. The set of values, which were used to code missing data. “Blanks” or “sysmis” values should not be used as missing values because it is not possible to discriminate between fields which were deliberately left blank (items that were not answered or are missing by design) and fields which were just skipped on data entry. Different kinds of missing values should be distinguished: e.g. missing by design (e.g. because some questions were only asked in the control group), not applicable (e.g. pregnancy for male participants), not answered. Therefore, you should assign different codes to these missing value patterns and, subsequently, value labels to these codes. It is important to standardize missing values (i.e. there is one code for each kind of missing value which is consistently used throughout your dataset). In some cases, it may be useful to define a range of missing values. Defining a range of missing values (e.g. a missing value range that is defined as 6-99 for a 5-point Likert-scale) facilitates excluding wild codes (e.g. 55 instead of 5 because of typing errors) or measurement errors from analyses (e.g. measurements of heart rates that are higher than 220 beats per minute).
Extended Information
The following information should be included in either the variable label or in a separate attribute field if they enhance data intelligibility:
- Variable item text/instruction. The exact wording of the questionnaire item, software instruction, etc. corresponding to the variable (in consideration of third party rights).
- Measurement occasion. The measurement occasion for the variable (e.g. wave 1, pre-treatment).
- Instrument. The measurement instrument to which the variable belongs.
- Construct. The theoretical construct that is measured by a variable.
- Unit of measurement. The unit of measurement for continuous variables (e.g. meter, seconds).
- Response unit. The entity that provided information.
- Analysis unit. The unit that is analysed in the variable. Note that response unit and analysis unit are not necessarily the same (e.g. parents providing information on their child’s behaviour)
- Filter variable. Is this variable a filter variable? Depending on participants’ responses on a filter variable, a set of subsequent items/questions will be presented or not. For example, the variable “ marital status” is a filter variable if a set of questions is only presented to subjects who stated that they were married.
- Imputation. If any kind of imputation took place this should be documented.
Further Resources and Tools
Arslan, R. C. (2019). How to automatically document data with the codebook package to facilitate data re-use. Advances in Methods and Practices in Psychological Science, 2(2), 169–187. https://doi.org/10.1177/2515245919838783
Other guidelines can be retrieved from the equator network which maintains a directory of Guidelines on health sciences with more than 280 entries.
FAIRsharing.org curates information on inter-related data standards, databases and policies (in life, environmental and biomedical sciences).
The RDA Metadata Standards Directory Working Group is composed of individuals and organizations that are involved in developing, implementing and using metadata for scientific data.
Information on the codebook is based on the ICPSR website on codebooks and the corresponding information in the PsychData Manual.