Skip to content

Composition

In general, the questions herein should be studied before data collection. The datasheets paper1 notes that most of the questions:

"… are intended to provide dataset consumers with the information they need to make informed decisions about using the dataset for their chosen tasks. … the questions are designed to elicit information about compliance with the EU's General Data Protection Regulation (GDPR) or comparable regulations in other jurisdictions."



What does each data set instance represent?

Describe what each instance, i.e., row, of the data set represents. Although peculiar, if the data set is multi-representational, e.g., an instance in a merchant's data base table might represent one of: (a) an online reading event, (b) an online product purchasing event, or \(c\) an online download event. Each representation must be described.



The # of Instances

How many instances does the data set have?



Pre-processed?

Are any aspects of the data set pre-processed? If yes:

  • Document the pre-processing steps.
  • State whether the underlying raw data is available, and provide a link to the data.
  • If available, provide a link to the pre-processing programs.



Is the data set a sample of a larger data set?

If yes:

  • If the data set is representative of the larger data set: How was representativeness verified/validated?
  • If the data set is not representative of the larger data set, e.g., is a geographically focused subset, explain why.



Lineage

Summarise the data set's lineage, including linkage options.2, 3



Licences & Fees

If applicable, summarise the data's costs.



Profiles of Instances

Herein, the focus is a summary of the instances of a data set, e.g., for a tabular data set:

By Field

  • The field name.
  • Description: What does the element of an instance denote/represent?
  • Data type.
  • Dictionary of a categorical data type.
  • Unit of measure.
  • Is this a raw data field or a feature?
  • Is this a target field?
  • Does the field identify a sub-population?
  • Column Profile: Note column profiling "… provides statistical information regarding the distribution of data values and associated patterns that are assigned to each data attribute, …". 5     If a field/column has missing elements, explain why.
  • A graph of the field's data distribution.


Across Fields

  • Cross-Column Profiles: Relationships between columns.



Errors

Please detail any errors, sources of noise, or redundancies.



Are there recommended data splits?



Confidentiality

Does the data set contain data that might be considered confidential? For example,

  • Is the data protected by legal privilege or by doctor–patient confidentiality?
  • Does the data include the content of private/non-public communications of individuals.



Identification of Individuals

Is it possible to identify individuals directly or indirectly?



Data Sensitivity

Does the data set include sensitive data elements? Describe. Examples of sensitive data elements are elements that directly/indirectly reveal:

  • Locations.
  • Financial details.
  • Health details.
  • Biometric profiles.
  • Genetic profiles.
  • Government identification codes of individuals.
  • Criminal history.
  • Institutionally and/or commercially sensitive data.
  • Race or ethnic origin.
  • Sexual orientations.
  • Religious beliefs.
  • Political opinions.
  • Trade union memberships.
  • And more.



Distressing Data Elements

Does ``… the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?''4, 1












  1. Datasheets for Datasets, arXiv:1803.09010v8, 2021, updated datasheet appendix 

  2. QLIK: What is data lineage? 

  3. IBM: What is data lineage? 

  4. Datasheets for Datasets, Communications of the ACM, 2021, Volume 64, Issue 12, pages 86 – 92 

  5. 5.5.2 Profiling for Data Quality Assessment, in Master Data Management, Page 96, The MK/OMG Press, 2008 

  6. What is data profiling?