Data Description Document Guidelines

The purpose of the Data Description Document is to provide future researchers (including yourself) with the information needed to understand and use the archived data. This file is required for submission of materials to the ScholarsArchive data repository and is the primary file associated with a data archive.

The Short Version

If you do not want to read all the details (although that is recommended), here’s the executive summary:

The file should be a text document with the filename DataDescription-<datasetname>
At a minimum, it should contain a list of every file in the data set and a brief description of the data included in each file.
Ideally, there should be a detailed description of the data in each file including software used, data units (i.e. tabular columns, data records, etc.) and their meanings, and any caveats related to the use of the data.
The better the data description document, the easier the data will be to use and the more likely it will be used and cited.

Document Format

The document should be a simple text document. While it could be constructed in a word processor and saved in that format (i.e. .odt or .docx format), ideally it should be saved as a simple text file (.txt) for maximum compatibility with future software. Alternately, PDF files will also be accepted if formatting of the file contents is critical and cannot be achieved via a simple text file. The file name should be DataDescription-<DatasetName> with the appropriate file extension based on its format (.txt, .odt, .docx, etc.) <DatasetName> should be the title of the dataset (or part of the title if it is long) and is used to distinguish the data description document from the others in the repository.

Document Content

The goal of the Data Description Document is to record all information about the data files and their contents so that someone can use the data in a future research project and understand the data’s content and structure. For a short little video on why documenting the data content is important watch https://youtu.be/N2zK3sAtr-4.

The target audience for this file is a future researcher that is at least passingly familiar with the field of study, but possibly not an expert. Jargon and technical terms are okay but their meaning should be unambiguous. If a term could have different meanings, it is best to avoid it or define precisely what is meant by the term in the context of the data (e.g. the term “phase data” in a physics data set could refer to the parts of a cycle or the states of matter). The clearer the descriptions, the easier the data will be to understand and work with and the more likely that researchers will use and cite the data. When in doubt, define the terms for maximum clarity.

Minimum Requirements

At a bare minimum, the document should contain a list of each file in the dataset, a short description of the contents of the data, and the software used to create the data. If there are any obvious (or egregious) issues, quirks, or “gotchas” with using the data, these should be noted as well. This is the “if I went away for a summer and then started using the data again, what would I want to remember” level of documentation.

In a Perfect World…

The document should contain everything someone completely new to the data would need to know to access and interpret the data and make sense of its contents. This would include such things as a detailed description of the type of data in the file, the software used to create the data (down to the exact version number and where to get it), and any specific settings and parameters used in that software–whether they are embedded in the file data itself or not.

For tabular data, there should be a description of each column (or row depending on how it is organized) listing the column name, the type of data in the column, any special formatting, minimum and maximum data ranges and units when appropriate, as well as any other information needed to understand the data in the column. Where this is self-documented by the file format, it could be omitted but would still be useful to someone reading the document and deciding if they should download the data or not.

For other types of data, there should be descriptions of the smaller data blocks within the file and may include such information as how, when, and where the data was collected, or any other information that would allow a potential user to evaluate the usefulness of the data and understand its contents when using it.

Generally, this level of documentation includes information that is self-documented in the file types so that someone does not have to get the files and open them to understand the contents. It’s the “I want to give this to a new undergraduate researcher and not have them bugging me constantly for explanations” level of documentation.

In Reality

Producing ideal documentation is difficult and time-consuming and we always want to get on with the next best thing. Hitting the perfect documentation mark would be great but getting at least to the “I want to give this to my new graduate student and not have them asking me data format questions all the time” level is probably more realistic.

This is somewhere between the minimum and ideal levels and will vary based on the data, its formats, and the discipline. Basically, the Data Description Document should include descriptions of each file and any information on the format and usability that is not self-documented in the data itself. It should also list any caveats, irregularities, or other issues that a user may encounter and any special considerations that the user needs to be aware of.

When in doubt, you might ask a colleague or graduate student to read over your data description to see if they understand it enough to take your data and begin a new research study without excessive explanation from you. In the end that is what you are striving for.