Agreements ‘in the wild’: Standards and alignment in machine learning benchmark dataset construction. Big Data & Society, 11:2, pp. 1–13. doi: 10.1177/20539517241242457
How do AI scientists create data? This ethnographic case study provides an empirical exploration into the making of a computer vision dataset intended by its creators to capture everyday human activities from first-person perspectives arguably "in the wild". While this insider term suggests that the data is direct and untouched, this article reveals the extensive work that goes into shaping and curating it.
The dataset is shown to be the outcome of deliberate curation and alignment work. The work involved requires a lot of careful coordination. Using protocols like data annotation guidelines meant to standardize the inputs of metadata helps to ensure consistency in technical work, contributing to making it usable for machine learning development.
A deeper intervention regards the dataset creators’ use of a standardized list of daily activities, based on an American government survey, to guide what the dataset should cover and what the data subjects should record. The list did not correspond with the lives of the data subjects situated in different contexts without friction.
To address this, dataset creators engaged in alignment work—negotiating with data subjects to ensure their recordings fit the project’s needs. A series of negotiations brought the diverse contributions into a shared direction, structuring the activities and the subsequent data. The careful structuring of activities, the instructions given to participants, and the subsequent annotation of the data all contribute to shaping the dataset into something that can be processed by computers.
The study uses these results to promote a reconsideration of how we view the "wildness" of data—a quality associated with objectivity and detachment in science—and to recognize the layers of human effort that go into creating datasets. Can we consider data 'wild' if it’s so carefully curated? It is perhaps not as "wild" as it might seem—the alignment work required to make the dataset usable for machine learning does seem to blur the line between intervention and detachment. A fuller understanding of scientific and engineering practices thus emerges when we consider the often unseen, labor-intensive work that coordination and agreement within research teams rely on.
The dataset creators moreover developed specific benchmarks and organized competitions to measure the performance of different models on tasks like action recognition and 3D scene understanding based on the dataset. Benchmark datasets are crucial for evaluating and comparing the performance of machine learning models. Benchmarks, following the Common Task Framework, help standardize the evaluation process across the AI community: it enables distributed groups to collaborate. As actors convene to size up different models, benchmarks become a social vehicle that engenders shared meanings on model performance. However, they also reinforce the limitations inherent in the data, as they become the yardstick by which new models are evaluated. This underscores the importance of closely examining how benchmark datasets are constructed in practice, and the qualities they are attributed.