Saturday 5 October 2024

Guest Blog: Interoperable and Standardized Algorithmic Images: The Domestic War on Drugs and Mugshots Within Facial Recognition Technologies

by Aaron Tucker

Tucker, A. (2024). Interoperable and standardized algorithmic images: The domestic war on drugs and mugshots within facial recognition technologies. Big Data & Society, 11(3). https://doi.org/10.1177/20539517241274593

Generative AI (GenAI), such as Midjourney, Stable Diffusion, and DALL-E, are data visualization systems. Such technologies are the result of their training data in combination with dense algorithmic mathematics: the images produced by such systems surface that original training data, pairings of text and image, for better and worse. 

This dynamic is especially problematic when image data that are laced with racialized vectors of power, such as mugshots, are freely available to those building GenAI models. From their inception in the late 19th century, by scientists such as Francis Galton and Alphonse Bertillion, mugshots were always meant to be mobile and standardized: the accepted visuality of the mugshot, as a front facing pose and a side profile taken in light designed to maximize visibility, ensured that the photograph could be “accurately” compared to any face in question across a variety of locations. 

Such logics were adopted by computer scientists solving the problem of computational face recognition and mugshots. Reports such as the 1997 “Best Practice Recommendation for the Capture of Mugshots” stressed that mugshots needed to be “interoperable” so that they could shift between various FRTs and applications. 

Mugshots are intercut with socio-technical systems such as policing practices, mental health support, and addiction support; mugshots are not neutral images, but rather a composite of affect, lived narrative, social power structures, and, often, violence in many forms. Therefore, as Katherine Biber warned in her 2013 article, “In Crime’s Archive,” there are real dangers when the criminal archive slip uncritically into the cultural sphere. It is crucial that we pay attention to GenAI and the ways that it tells on itself and the data it is visualizing through its creations. As my article describes, the ability to generate images with the prompt of “a mugshot” that are defined by the same biases as mugshot databases is alarming. 

The solution is not to ban such prompts or crack down on prompt engineering that surfaces such results, but rather to address the root issue: the uncritical use and re-use of problematic data in machine training, not just in computer vision systems, but in all AI systems. 



Monday 23 September 2024

BD&S 2024 Online Colloquium: POLITICS, POWER AND DATA

This year’s Big Data & Society colloquium centers on the theme of "Politics, Power, and Data," exploring the complex intersections where data, algorithms, and socio-political forces converge. Mark your calendars and be sure to join this exciting set of four talks. Details below.


Sunday 22 September 2024

Guest Blog: By Isak Engdahl

Agreements ‘in the wild’: Standards and alignment in machine learning benchmark dataset construction. Big Data & Society, 11:2, pp. 1–13. doi: 10.1177/20539517241242457

 

How do AI scientists create data? This ethnographic case study provides an empirical exploration into the making of a computer vision dataset intended by its creators to capture everyday human activities from first-person perspectives arguably "in the wild". While this insider term suggests that the data is direct and untouched, this article reveals the extensive work that goes into shaping and curating it. 

 

The dataset is shown to be the outcome of deliberate curation and alignment work. The work involved requires a lot of careful coordination. Using protocols like data annotation guidelines meant to standardize the inputs of metadata helps to ensure consistency in technical work, contributing to making it usable for machine learning development. 

 

A deeper intervention regards the dataset creators’ use of a standardized list of daily activities, based on an American government survey, to guide what the dataset should cover and what the data subjects should record. The list did not correspond with the lives of the data subjects situated in different contexts without friction. 

 

To address this, dataset creators engaged in alignment work—negotiating with data subjects to ensure their recordings fit the project’s needs. A series of negotiations brought the diverse contributions into a shared direction, structuring the activities and the subsequent data. The careful structuring of activities, the instructions given to participants, and the subsequent annotation of the data all contribute to shaping the dataset into something that can be processed by computers.

 

The study uses these results to promote a reconsideration of how we view the "wildness" of data—a quality associated with objectivity and detachment in science—and to recognize the layers of human effort that go into creating datasets. Can we consider data 'wild' if it’s so carefully curated? It is perhaps not as "wild" as it might seem—the alignment work required to make the dataset usable for machine learning does seem to blur the line between intervention and detachment. A fuller understanding of scientific and engineering practices thus emerges when we consider the often unseen, labor-intensive work that coordination and agreement within research teams rely on.

 

The dataset creators moreover developed specific benchmarks and organized competitions to measure the performance of different models on tasks like action recognition and 3D scene understanding based on the dataset. Benchmark datasets are crucial for evaluating and comparing the performance of machine learning models. Benchmarks, following the Common Task Framework, help standardize the evaluation process across the AI community: it enables distributed groups to collaborate. As actors convene to size up different models, benchmarks become a social vehicle that engenders shared meanings on model performance. However, they also reinforce the limitations inherent in the data, as they become the yardstick by which new models are evaluated. This underscores the importance of closely examining how benchmark datasets are constructed in practice, and the qualities they are attributed.

 

 

Saturday 14 September 2024

Guest Blog: The 'doings' behind the data

by Isaelle Donatz-Fest

Donatz-Fest, I. (2024). The ‘doings’ behind data: An ethnography of police data construction. Big Data & Society, 11(3). https://doi.org/10.1177/20539517241270695

Data are the lifeblood of algorithmic systems. But data are often taken for granted by public organizations who see them as something just lying around, ready to use. Such is the case with police reports, which are increasingly used as data for algorithmic applications for policing worldwide.

But there are ‘doings’ behind data. Data are created in unexpected places—like the front seat of a speeding police car or the desk of an overworked detective. Material factors and human actors interact behind-the-scenes, informing data creation and interpretation.

I spent ~200 hours (ethnographically) observing how street-level employees at the Netherlands Police translate events to police reports. What I found was that data work is deeply embedded in policing, shaped by personal values, organizational context, and practical considerations. 

Structured data often clashes with the officers' understanding of a situation. Registration software demands incidents are fit into predefined categories, but the messy world that we live in rarely fits neatly into such boxes. Unstructured data provides more flexibility and richness but introduces complexities for standardization and (algorithmic) interpretation. Open text fields open the door to linguistic nuances, inconsistencies, and what I term 'voice,' the various identities present in the text.

I saw officers wrestle with these limitations, sometimes bending rules, sometimes choosing the path of least resistance. These choices reflect officer values and the pressures they face. Whether it’s a commitment to justice, a desire to help a colleague, or the need to quickly move on to the next call, the context impacts the data directly.

This work offers new empirical insight on the data underpinning public sector algorithms. By understanding the doings behind data, we can begin to question how we use them in algorithmic systems, which is particularly relevant in fields as impactful and powerful as policing. 

Wednesday 28 August 2024

Guest Blog: Problem-solving? No, problem-opening! A Method to Reframe and Teach Data Ethics as a Transdisciplinary Endeavour

by Stefano Calzati and Hendrik Ploeger

Calzati, S., & Ploeger, H. (2024). Problem-solving? No, problem-opening! A method to reframe and teach data ethics as a transdisciplinary endeavour. Big Data & Society, 11(3). https://doi.org/10.1177/20539517241270687

Technology can go a long way in “chopping up” reality and reifying resources – and data are no exception to that. The thingness of data – just think of the refrain “data as the new oil” – is often considered as a given, i.e., a datum. Yet a growing body of research has shown that data are inherently sociotechnical, leading to regard them as bundlings originating from processes at the coalescing point of technical and non-technical actors, factors, and values. So, the questions become: How to operationalize this? How, for instance, to teach new generations of undergraduates being trained in computer science, data analytics, software engineering, and similar technical subjects, that data are sociotechnical bundlings? How to incorporate such understanding into their practices? 

In the article “Problem-solving? No, problem-opening! A Method to Reframe and Teach Data Ethics as a Transdisciplinary Endeavour” we set out to answer these questions. First, we reconceptualize data ethics as not much a normative (dos vs don’ts) and axiomatic (good vs bad) toolbox, but a critical compass to think about data as sociotechnical bundlings and orient their fair processing. This, in turn, entails that data technologies are always good and bad at once, insofar as they produce, at all times, value-laden entanglements and un/intended consequences that demand to be unpacked and assessed in context, i.e., from different perspectives, simultaneously, and over time. This is an inherent transdisciplinary endeavor which cuts across epistemological boundaries, resists any privilege point of reference, and configures an ongoing multidimensional analysis. 

What we describe in detail in the article is the application of this view to an elective course titled “Ethics for the data-driven city” which we purposedly designed and taught as part of the Geomatics master program at Delft University of Technology. Notably, we developed a transdisciplinary method that is not problem-solving, but problem-opening, that is, a method that help students recognize and problematize the irreducibility of all ethical stances and the contingency of all technological “solutions”, especially when these are situated in the city as a complex system that resists computation. Overall, the course compels students, on the one hand, to think critically about (the definition of) problems, by shifting the ground on which engineering problem-solving rests, and, on the other hand, to materialize such critical shift into their final assignments, conceived in the form of digital or physical artefacts.

Tuesday 18 June 2024

BD&S Journal will be on break from Aug 4th to Sept 4th, 2024

 The editorial team of the journal Big Data & Society will be on break from August 4th to September 4th 2024.  Please accept any delays in processing and reviewing your submission, and in related correspondence during that time. Thank you!