Data & Datasheets¶

The datasheet of a dataset outlines the data set's provenance, lineage, and a bit more. Each dataset must have its own datasheet. Gebru & colleagues first proposed Datasheets for Datasets, for machine learning products or projects, during the years 2018/2019. Following feedback from a variety of institutions, industries, agencies, etc., a baseline Datasheets for Datasets outline was released.¹^,²

Each datasheet consists of a set of questions, and the datasheet's primary objective vis-à-vis data set creator is to:

… encourage data set creators to reflect carefully upon (a) the "process of creating, distributing, and maintaining a dataset", and (b) "any underlying assumptions, potential risks or harms, and implications of use“

The primary objective vis-à-vis data set user is to:

… ensure that the data set user has the information required to "… make informed decisions about using a dataset."

This chapter consists of a set of sections. The sections, except the Natural Language Processing section, reflect the groupings of the questions in the latest datasheet version. ¹^,²^,³ The questions of the natural language processing (NLP) section apply to NLP projects only. The questions were developed by Bender & Friedman. ⁴

Datasheets for Datasets, Communications of the ACM, 2021, Volume 64, Issue 12, pages 86 – 92 ↩↩
Datasheets for Datasets, arXiv:1803.09010v8, 2021, updated datasheet appendix ↩↩
If a question is inapplicable, note down its inapplicability. ↩
Data Statements for Natural Language Processing:Toward Mitigating System Bias and Enabling Better Science, Transactions of the Association for Computational Linguistics, 2018, 6: 587–604 ↩