Thursday, 7 November 2024

Paper Highlight: "Ethical scaling for content moderation: Extreme speech and the (in)significance of artificial intelligence"

Sahana Udupa discusses ''Ethical scaling for content moderation: Extreme speech and the (in)significance of artificial intelligence'', published by her and colleagues Antonis Maronikolakis and Axel Wisiorek in Big Data&Society. Read it on this link.

The paper received the ICA Outstanding Article Award for 2024. Congratulations!

Click here to know more about the Centre for Digital Dignity.

Friday, 25 October 2024

Guest Blog: Automated Informed Consent

by Adam John Andreotta and Björn Lundgren

Andreotta, A. J., & Lundgren, B. (2024). Automated informed consent. Big Data & Society, 11(4). https://doi.org/10.1177/20539517241289439

If you are an average internet user, you probably experience prompts for consent daily. For example, you may have been asked, via a pop-up window, whether you would like to accept all cookies, or only those necessary to function on a website you have visited. While it is a good thing that companies actually seek your consent, repeated requests cause so much annoyance that many of us become complacent about the content of the agreement. No one really has the time to read and accept every privacy policy that applies to them online. 

One solution to this problem that has emerged in the last few years is so-called “automated consent.” The idea is that software could learn what your privacy preferences are, and then do all the consenting for you. The idea is promising, but there are technical, legal, and ethical issues associated with it. In the paper, we deal with some of the ethical issues around automated consent. We start by articulating three different versions of automated consent: a reject-all option, a semi-automated option, and a fully automated option, which can feature AI and machine learning. Regardless of which version is used, we argue that a series of normative issues need to be wrestled with before wide scale adoption of the technology.  These include concerns about whether automated consent would prohibit peoples’ ability to give informed consent; whether automated consent might negatively impact people’s online competencies; whether the personal data collection required for automated consent to function raises new privacy concerns; whether automated consent might undermine market models; and how responsibility and accountability might be impacted by automated consent. Of course, there is much to be said regarding the legal implications of the technology itself, and we invite legal scholars and computer scientists to explore the regulatory implications and technological options related to automated consent in our discussion. 

Saturday, 5 October 2024

Guest Blog: Interoperable and Standardized Algorithmic Images: The Domestic War on Drugs and Mugshots Within Facial Recognition Technologies

by Aaron Tucker

Tucker, A. (2024). Interoperable and standardized algorithmic images: The domestic war on drugs and mugshots within facial recognition technologies. Big Data & Society, 11(3). https://doi.org/10.1177/20539517241274593

Generative AI (GenAI), such as Midjourney, Stable Diffusion, and DALL-E, are data visualization systems. Such technologies are the result of their training data in combination with dense algorithmic mathematics: the images produced by such systems surface that original training data, pairings of text and image, for better and worse. 

This dynamic is especially problematic when image data that are laced with racialized vectors of power, such as mugshots, are freely available to those building GenAI models. From their inception in the late 19th century, by scientists such as Francis Galton and Alphonse Bertillion, mugshots were always meant to be mobile and standardized: the accepted visuality of the mugshot, as a front facing pose and a side profile taken in light designed to maximize visibility, ensured that the photograph could be “accurately” compared to any face in question across a variety of locations. 

Such logics were adopted by computer scientists solving the problem of computational face recognition and mugshots. Reports such as the 1997 “Best Practice Recommendation for the Capture of Mugshots” stressed that mugshots needed to be “interoperable” so that they could shift between various FRTs and applications. 

Mugshots are intercut with socio-technical systems such as policing practices, mental health support, and addiction support; mugshots are not neutral images, but rather a composite of affect, lived narrative, social power structures, and, often, violence in many forms. Therefore, as Katherine Biber warned in her 2013 article, “In Crime’s Archive,” there are real dangers when the criminal archive slip uncritically into the cultural sphere. It is crucial that we pay attention to GenAI and the ways that it tells on itself and the data it is visualizing through its creations. As my article describes, the ability to generate images with the prompt of “a mugshot” that are defined by the same biases as mugshot databases is alarming. 

The solution is not to ban such prompts or crack down on prompt engineering that surfaces such results, but rather to address the root issue: the uncritical use and re-use of problematic data in machine training, not just in computer vision systems, but in all AI systems. 



Monday, 23 September 2024

BD&S 2024 Online Colloquium: POLITICS, POWER AND DATA

This year’s Big Data & Society colloquium centers on the theme of "Politics, Power, and Data," exploring the complex intersections where data, algorithms, and socio-political forces converge. Mark your calendars and be sure to join this exciting set of four talks. Details below.


Sunday, 22 September 2024

Guest Blog: By Isak Engdahl

Agreements ‘in the wild’: Standards and alignment in machine learning benchmark dataset construction. Big Data & Society, 11:2, pp. 1–13. doi: 10.1177/20539517241242457

 

How do AI scientists create data? This ethnographic case study provides an empirical exploration into the making of a computer vision dataset intended by its creators to capture everyday human activities from first-person perspectives arguably "in the wild". While this insider term suggests that the data is direct and untouched, this article reveals the extensive work that goes into shaping and curating it. 

 

The dataset is shown to be the outcome of deliberate curation and alignment work. The work involved requires a lot of careful coordination. Using protocols like data annotation guidelines meant to standardize the inputs of metadata helps to ensure consistency in technical work, contributing to making it usable for machine learning development. 

 

A deeper intervention regards the dataset creators’ use of a standardized list of daily activities, based on an American government survey, to guide what the dataset should cover and what the data subjects should record. The list did not correspond with the lives of the data subjects situated in different contexts without friction. 

 

To address this, dataset creators engaged in alignment work—negotiating with data subjects to ensure their recordings fit the project’s needs. A series of negotiations brought the diverse contributions into a shared direction, structuring the activities and the subsequent data. The careful structuring of activities, the instructions given to participants, and the subsequent annotation of the data all contribute to shaping the dataset into something that can be processed by computers.

 

The study uses these results to promote a reconsideration of how we view the "wildness" of data—a quality associated with objectivity and detachment in science—and to recognize the layers of human effort that go into creating datasets. Can we consider data 'wild' if it’s so carefully curated? It is perhaps not as "wild" as it might seem—the alignment work required to make the dataset usable for machine learning does seem to blur the line between intervention and detachment. A fuller understanding of scientific and engineering practices thus emerges when we consider the often unseen, labor-intensive work that coordination and agreement within research teams rely on.

 

The dataset creators moreover developed specific benchmarks and organized competitions to measure the performance of different models on tasks like action recognition and 3D scene understanding based on the dataset. Benchmark datasets are crucial for evaluating and comparing the performance of machine learning models. Benchmarks, following the Common Task Framework, help standardize the evaluation process across the AI community: it enables distributed groups to collaborate. As actors convene to size up different models, benchmarks become a social vehicle that engenders shared meanings on model performance. However, they also reinforce the limitations inherent in the data, as they become the yardstick by which new models are evaluated. This underscores the importance of closely examining how benchmark datasets are constructed in practice, and the qualities they are attributed.

 

 

Saturday, 14 September 2024

Guest Blog: The 'doings' behind the data

by Isaelle Donatz-Fest

Donatz-Fest, I. (2024). The ‘doings’ behind data: An ethnography of police data construction. Big Data & Society, 11(3). https://doi.org/10.1177/20539517241270695

Data are the lifeblood of algorithmic systems. But data are often taken for granted by public organizations who see them as something just lying around, ready to use. Such is the case with police reports, which are increasingly used as data for algorithmic applications for policing worldwide.

But there are ‘doings’ behind data. Data are created in unexpected places—like the front seat of a speeding police car or the desk of an overworked detective. Material factors and human actors interact behind-the-scenes, informing data creation and interpretation.

I spent ~200 hours (ethnographically) observing how street-level employees at the Netherlands Police translate events to police reports. What I found was that data work is deeply embedded in policing, shaped by personal values, organizational context, and practical considerations. 

Structured data often clashes with the officers' understanding of a situation. Registration software demands incidents are fit into predefined categories, but the messy world that we live in rarely fits neatly into such boxes. Unstructured data provides more flexibility and richness but introduces complexities for standardization and (algorithmic) interpretation. Open text fields open the door to linguistic nuances, inconsistencies, and what I term 'voice,' the various identities present in the text.

I saw officers wrestle with these limitations, sometimes bending rules, sometimes choosing the path of least resistance. These choices reflect officer values and the pressures they face. Whether it’s a commitment to justice, a desire to help a colleague, or the need to quickly move on to the next call, the context impacts the data directly.

This work offers new empirical insight on the data underpinning public sector algorithms. By understanding the doings behind data, we can begin to question how we use them in algorithmic systems, which is particularly relevant in fields as impactful and powerful as policing.