Natural Language Processing¶

Background¶

This section only applies to data sets for natural language processing (NLP). The focus being the proposed data statements of Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science by E Bender & B Friedman.

Please study the description of each statement within Section 5. Subsequently, and per data set, record the applicable data statements. Brief excerpts below.

Data Statement Queries in Brief¶

Curation Rationale¶

Why was the data collected? In brief, which "… texts were included and what were the goals in selecting texts, both in the original collection and in any further sub-selection?" Impacts inference.

Language Variety¶

This is important because languages "… differ from each other in structural ways that can interact with NLP algorithms".
Additionally, and within a language, "… regional or social dialects can also show great variation".

Speaker Demographic¶

Important in cases wherein "… variation (in pronunciation, prosody, word choice, and grammar) correlates with speaker-demographic characteristics".

Annotator Demographic¶

In brief, what are the "… demographic characteristics of the annotators and [the] annotation guideline developers?" These are critical questions because they give an insight into "… their experience with language and thus their perception of what they are annotating".

Speech Situation¶

Mainly because "… characteristics of the speech situation can affect linguistic structure and patterns …".

Text Characteristics¶

Mainly because "… genre and topic influence the vocabulary and structural characteristics of texts".¹

Recording Quality¶

If applicable, outline recording quality.

Provenance Appendix¶

If the data set was derived from an existing data set "… the data statements for the source datasets should be included as an appendix"

Other Pertinent Details¶

Record pertinent details that the previous statements do not request.

The text Dimensions of Register Variation: A Cross-Linguistic Comparison might be helpful. ↩