Skip to content

Natural Language Processing

Background

This section only applies to data sets for natural language processing (NLP). The focus being the proposed data statements of Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science by E Bender & B Friedman.

Please study the description of each statement within Section 5. Subsequently, and per data set, record the applicable data statements. Brief excerpts below.


Data Statement Queries in Brief

Curation Rationale

Why was the data collected? In brief, which "… texts were included and what were the goals in selecting texts, both in the original collection and in any further sub-selection?" Impacts inference.


Language Variety

This is important because languages "… differ from each other in structural ways that can interact with NLP algorithms".
Additionally, and within a language, "… regional or social dialects can also show great variation".


Speaker Demographic

Important in cases wherein "… variation (in pronunciation, prosody, word choice, and grammar) correlates with speaker-demographic characteristics".


Annotator Demographic

In brief, what are the "… demographic characteristics of the annotators and [the] annotation guideline developers?" These are critical questions because they give an insight into "… their experience with language and thus their perception of what they are annotating".


Speech Situation

Mainly because "… characteristics of the speech situation can affect linguistic structure and patterns …".


Text Characteristics

Mainly because "… genre and topic influence the vocabulary and structural characteristics of texts".1


Recording Quality

If applicable, outline recording quality.


Provenance Appendix

If the data set was derived from an existing data set "… the data statements for the source datasets should be included as an appendix"


Other Pertinent Details

Record pertinent details that the previous statements do not request.