Monday, 12 April 2021

Heritage-based tribalism in Big Data ecologies: Deploying origin myths for antagonistic othering

Chiara Bonacchi introduces her new article with Marta Krzyzanska, 'Heritage-based tribalism in Big Data ecologies: Deploying origin myths for antagonistic othering' in Big Data & Society doi:10.1177/20539517211003310. First published March 23rd 2021.

Video abstract

This article presents a conceptual and methodological framework to study heritage-based tribalism in Big Data ecologies by combining approaches from the humanities, social and computing sciences. We use such a framework to examine how ideas of human origin and ancestry are deployed on Twitter for purposes of antagonistic ‘othering’. Our goal is to equip researchers with theory and analytical tools for investigating divisive online uses of the past in today’s networked societies. In particular, we apply notions of heritage, othering and neo-tribalism, and both data-intensive and qualitative methods to the case of people’s engagements with the news of Cheddar Man’s DNA on Twitter. We show that heritage-based tribalism in Big Data ecologies is uniquely shaped as an assemblage by the coalescing of different forms of antagonistic othering. Those that co-occur most frequently are the ones that draw on ‘Views on Race’, ‘Trust in Experts’ and ‘Political Leaning’. The framings of the news that were most influential in triggering heritage-based tribalism were introduced by both right- and left-leaning newspaper outlets and by activist websites. We conclude that heritage-themed communications that rely on provocative narratives on social media tend to be labelled as political and not to be conducive to positive change in people’s attitudes towards issues such as racism.

Wednesday, 31 March 2021

Welcoming Our New Assistant Editors

I am very happy to introduce the six new assistant editors who have joined the editorial team at Big Data and Society. As assistant editors they will working with the co-editors on special themes, helping authors set up blog posts, overseeing the journal's Twitter account as well as overseeing a number of special projects related to the journal. Welcome to you all!

Andrew Dwyer: University of Durham, UK
Rohith Jyothish: Jawaharlal Nehru University (JNU), New Delhi, India
Gemma Newlands: University of Amsterdam and BI Norwegian Business School
Natalia Orrego Tapia: Pontifical Catholic University of Chile
Yu-Shan Tseng: University of Helsinki, Finland
Joshua Uyheng: Carnegie Mellon University, USA

More details on the backgrounds, interests and expertise of all our assistant editors can be found at

We also delighted to have Margie Cheesman (Oxford Internet Institute) and Julie Saperstein (University of Kentucky) continue as AEs as well.

Finally, I want to also say goodbye to former Assistant Editor Mei-chun Lee (now Dr. Lee) is stepping down from this role. Thanks for all your hard work, we wish you all the best for the future.

Thursday, 11 March 2021

“Reach the Right People”: The Politics of “Interests” in Facebook’s Classification System for Ad Targeting

by Kelley Cotter, Mel Medeiros, Chankyung Pak, Kjerstin Thorson

In recent years, targeted advertising has gained a prominent place in American politics. In particular, political campaigns, candidates, and advocacy organizations have turned to Facebook for its voluminous array of options for targeting users according to “interests” inferred by machine learning algorithms. In this study, we explored for whom this (algorithmic) classification system for ad targeting works in order to contribute to conversations about the ways such systems produce knowledge about the world rooted in power structures. To do this, we critically analyzed the classification system from a variety of vantage points, particularly focusing on the representation of people of color (POC), women, and LGBTQ+ people. First, we drew on donated user data, which included a list of political ad categories people had been sorted into on Facebook. We also examined Facebook's documentation, training materials, and patents for insight into the inner workings of the system. Finally, we entered into the system via Facebook’s tools for advertisers to explore its contents.
Through this investigation, we catalogued a series of cases that reveal the political order enacted via Facebook’s classification system for ad targeting. We particularly highlight four themes. First, we demonstrate how certain ad categories reflect what Joy Buolamwini calls a "coded gaze," or the “embedded views that are propagated by those who have the power to code systems” (2016: n.p.).  Second, we highlight how a disproportionate number of ad categories for women and people of color hint at an unmarked user and what Tressie McMillan Cottom (2020) calls "predatory inclusion." Third, we describe cases of ad categories that flatten dimensions of identity and suggest Kimberlé Crenshaw’s (1989) notion of a "single-axis framework" of identity, which fails to capture the intersectionality of identity. Fourth, we illustrate how Facebook's classification system exhibits something akin to what Whitney Phillips (2018) refers to as "both-sides-ism" by allowing for ad categories that could either represent an interest in civil rights or the endorsement of hateful ideologies.
Through these cases, we argue that Facebook's classification system for ad targeting is necessarily political as a result of its underlying technical and commercial logics and the human choices embedded in datafication processes. The system prioritizes the interests of the socially and economically powerful and represents those who have been historically marginalized not on their own terms, but on the terms of those occupying more privileged positions. We suggest that, as a tool for political communication, Facebook’s classification system may have downstream implications for the political voice and representation of marginalized communities to the extent that political campaigns, advocacy groups, and activists increasingly rely on it for cultivating and mobilizing supporters. As Facebook weighs the decision of if/when to reinstate political advertising, our study urges continued critical reflection on whose “interests” are served by Facebook’s classification system (and others like it). 


Buolamwini J (2016) InCoding — In the Beginning Was the Coded Gaze. Available at:  

Cottom TM (2020) Where platform capitalism and racial capitalism meet: The sociology of race and racism in the digital society. Sociology of Race and Ethnicity 6(4): 441–449. DOI: 10.1177/2332649220949473

Crenshaw K (1989) Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. University of Chicago Legal Forum Article 8: 139. 

Phillips W (2018) The oxygen of amplification: Better practices for reporting on extremists, antagonists, and manipulators online. Data & Society.

Tuesday, 9 March 2021

Welcoming Dr. Sachil Singh as a co-editor of Big Data and Society

We are very happy to announce that Dr. Sachil Singh (Queen's University, Canada) is joining the editorial staff of the journal as a co-editor. Dr. Singh will expand our editorial expertise in health and medicine, critical race studies and surveillance/sorting.

Dr.  Sachil Singh
Surveillance Studies Centre, Department of Sociology, Queen’s University, Canada

Dr. Singh is a sociologist whose main areas of study are medical sociology, critical race studies and surveillance. The common thread in his work is attention to the racial outcomes of digital sorting technologies, which has allowed him to research topics as varied as credit scoring in South Africa and healthcare in Canada. His current work examines the extent to which healthcare practitioners in Canada rely on pithy conclusions about race and ethnicity from medical apps to inform patient diagnosis and treatment.


Monday, 22 February 2021

The Algorithm Audit: Scoring the algorithms that score us

by Shea Brown

Big Data & Society. doi:10.1177/2053951720983865. First published: January 28th 2021.

“The Algorithm Audit: Scoring the algorithms that score us” outlines a conceptual framework for auditing algorithms for potential ethical harms. In recent years, the ethical impact of AI has been increasingly scrutinized, and has led to a growing mistrust of AI and increased calls for mandated audits of algorithms. While there are many excellent proposals for ethical assessments of algorithms, such as Algorithmic Impact Assessments or the similar Automated Decision System Impact Assessments, these are too high level to be put directly into practice without further guidance. Other proposals have a more narrow focus on technical notions of bias or transparency (Mitchell et al., 2019). Moreover, without a unifying conceptual framework for carrying out these evaluations, there’s a worry that the ad hoc nature of the methodology could lead to potential harms being missed. 

We present an auditing framework that can serve as a more practical guide for comprehensive ethical assessments of algorithms. We clarify what we mean by an algorithm audit, explain key preliminary steps to any such audit (identifying the purpose of the audit, describing and circumscribing its context) and elaborate on the three main elements of the audit instrument itself: (i) a list of possible interests and rights of stakeholders affected by the algorithm, (ii) a list and assessment of metrics that describe key ethically salient features of the algorithm in the relevant context, and (ii) a relevancy matrix that connects the assessed metrics to the stakeholder interests.  We provide a simple example to illustrate how the audit is supposed to work, and discuss the different forms the audit result could take (quantitative score, qualitative score, and a narrative assessment).  

Our motivations for this separation of descriptive (metrics) and normative (interests) features are many, but one important reason is that this separation forces an auditor to carefully consider each stakeholder explicitly, and consider the possible relevance of various features of the algorithm (metrics) to that stakeholder’s interests. It’s important to note that different stakeholders in the same category (e.g. students, loan applicants, those up for parole, etc.) are often affected in very different ways by the same algorithm and often on the basis of race, ethnicity, gender, age, religion, or sexual orientation (Benjamin, 2019). We argue that understanding the context of an algorithm is a precursor to being able to not only enumerate stakeholder interests generally, but also to be able to identify particular sub-categories of stakeholders whose identification is relevant for ethical assessment of an algorithm (e.g. students of color, Hispanic loan applicants, male African-Americans up for parole, etc.). These stakeholders might face particular threats, and attention to context allows us to guard against thinking of groups of stakeholders are homogeneous entities that will be negatively or positively affected simply in virtue of the type of engagement with an algorithm, and to recognize socio-political and socio-technical factors, and power dynamics at play (Benjamin, 2019; D’Ignazio and Klein, 2020; Mohamed et al., 2020).

The proposed audit instrument yields an ethical evaluation of an algorithm that could be used by regulators and others interested in doing due diligence, while paying careful attention to the complex societal context within which the algorithm is deployed. It can also help institutions mitigate the reputational, financial, and ethical risk that a poorly performing algorithm might present.  


Sunday, 13 December 2020

How the pandemic made visible a new form of power

By Engin Isin and Evelyn Ruppert

The political pandemic known by its biomedical name COVID-19 has thrown us off balance. We have steadied ourselves (somewhat) by taking a historical view on forms of power and re-evaluating some long-held assumptions about power in social and political theory to see what the present might reveal. The result is this article where we broadly provide an account of three forms of power (sovereign, disciplinary and regulatory) and suggest that the coronavirus pandemic has brought these three forms of power into sharp relief while making particularly visible a fourth form of power that we name ‘sensory power’. We place sensory power within an historical series straddling the 17th and 18th (sovereign), 18th and 19th (disciplinary), and 19th and 20th (regulatory) centuries. The birth of sensory power straddles the 20th and 21st centuries and just like earlier forms of power neither replaces nor displaces but nestles and intertwines with them. So, rather than announcing the birth of a new form of power that marks this age or epoch we articulate its formation as part of a much more complex configuration of power. We conclude the article with a discussion the kinds of resistance that sensory power elicits.

This article follows the second edition of our book, Being Digital Citizens, which was published a few months earlier. Its new chapter examines various forms of digital activism. We were inspired by the range, spread, and diffusion of such acts of resistance and yet were unclear what form of power to which they were a response. Having examined these acts of resistance compelled us to take a historical view and to name a fourth form of power. With the onset of the pandemic, that form of power became visible and through the article we came to name it as sensory power.  


Both the article and the book are available at the links below and are open access.


Isin, Engin, and Evelyn Ruppert. 2020. ‘The Birth of Sensory Power: How a Pandemic Made It Visible’. Big Data & Society 7(2). Open access.


Isin, Engin F., and Evelyn Ruppert. 2020. Being Digital Citizens. Second edition. London: Rowman & Littlefield.  Go to tab 'features' to download open access pdf.

Thursday, 10 December 2020

Automated Facts, Data Contextualization and Knowledge Colonialism: A Conversation Between Denny Vrandečić and Heather Ford on Wikipedia’s 20th Anniversary

At the start of 2021, contributors across the world will celebrate the 20-year anniversary of the first Wikipedia, and its 250,000 volunteers who supports it over 300 languages. In advance of Wikipedia’s 20th anniversary, Joseph Reagle, co-editor of Wikipedia @ 20: Stories of an Incomplete Revolution, 2020 MIT Press, facilitated a conversation between two of the volume’s essayists on the potential power and pitfalls of Wikipedia’s latest project: Abstract Wikipedia, an effort to automate the sharing of “facts”, information and data across different languages versions of the online encyclopedia.

Denny Vrandečić (founder of Wikidata and now Abstract Wikipedia, see FN1) and Heather Ford (scholar of Wikipedia governance) discuss the automation of digitised facts (e.g., Marie Curie’s birth year is 1867) as a means of reducing user effort and possible mistakes. Vrandečić and Ford discuss the two related concerns of when abstract “facts” are divorced from their context via automation and how the automated import of decontextualized data from larger projects can overwhelm smaller ones. Both of these concerns relate to the question of how automation influences peoples’ agency over the information that they produce at a local level. Automation may enhance efficiencies but this big data practice has important consequences for power and accountability within the world’s most widely used encyclopedia. 

------ Context ------

Ford: One of the key dangers with the automated extraction of facts from Wikipedia is that it has resulted in the loss of citation data. On Wikipedia, there is a significant focus on citing statements to reliable sources and this work is seen as central to the principle of verifiability. But on Wikidata, many statements are only cited to a particular Wikipedia language version because of their mass import. This ultimately prevents readers being able to follow up the source of a statement and influences the accountability of Wikimedia. How will Abstract Wikipedia avoid this problem?

Vrandečić: The problem of unsourced statements is not new to Wikidata. Wikidata makes it much more visible, though. As of 2020, more than 65% of  statements on Wikidata are sourced – this has gone up from less than 20% five years ago. Wikidata is improving considerably. I am actually convinced that the ratio of referenced statements on Wikidata is by now considerably better than on Wikipedia. But unfortunately, we are often compared with a perfect 100%, but I don’t think that is fair.

On Wikipedia it is much harder to count how many of its statements are actually sourced. But if you take a look at an average article on Wikipedia, you will find that far less than half of the statements in it have references – but we have no automatic metric counting this, so the actual comparison is rather hidden. I would love to see more research in this area.

Abstract Wikipedia will be more similar to Wikidata: it will be far easier to count how many statements have references than on Wikipedia, which will be the same blessing and curse as for Wikidata.

The other advantage is that due to the way Abstract Wikipedia is built, we will be able to generate sentences for each statement, and not just structured data as in Wikidata. This may in fact allow us to easier look for references in digital source material. Together with the fact that we can more easily see which statements are missing references, it should be possible to more easily create workflows that will search for possible references explicitly, and allow the community to fill the gaps.

Ford: You are right that the problem of unsourced statements is not new to Wikidata and it is true that it is much easier to figure out how much of Wikidata is unsourced than Wikipedia. But I don’t think that this is a reason to disavow the problem. The problem is significant: it influences Wikipedia’s core principles and affects how Wikipedia’s facts are ultimately understood – not as provisional statements that emerge from particular perspectives, but as hard facts that cannot be questioned. One could also argue that the problem is even more important to solve on Wikidata because of the authority that datafied facts have when they are extracted and presented on third party platforms.

I am also not as hopeful as you are that the problem is being solved over time. Having discussed this with others, I believe that the reason we are seeing improvement in the statistics is that WikiCite’s focus is on creating and maintaining source and citation data rather than annotating current Wikidata items. If you look at Wikidata items about places, Johannesburg, for example, you’ll notice that the majority of statements are not sourced meaningfully beyond a particular Wikipedia language version. A vast amount of manual labour is needed to do this work, and I just don’t see it happening anytime soon without a clear incentive and programme to do so at scale. I agree with you that we need more research in this area to look at the extent to which particular topics (such as places) are well-cited and how this has changed over time, but first it needs to be acknowledged that there is a potential problem that isn’t being dealt with.

I hope that Abstract Wikipedia prioritises and necessitates the addition of sources as new statements are added. If this can happen at the outset, with the small scale approach at the start, and if Abstract Wikipedia prioritises human review of sources, then it will certainly be an improvement on Wikidata. 

Vrandečić: I also hope that the Abstract Wikipedia community will put the right emphasis on sourcing its content, but in the end this is a decision to be made by that community, and not by me.

I think what helped with improving the situation on Wikidata so much over the last few years was that once the concern was raised, dashboards and metrics were introduced which made the situation more visible. I think the fact that the coverage of statements with sources more than tripled since then is a testament to the fact that this problem has been recognized and has been worked on by the community, and that the community is in fact very capable to deal with such issues.

Because of that I plan to put the same trust in the future community that will work on Abstract Wikipedia. And I welcome researchers like you to critically observe and accompany that development, after all it was exactly these concerns that lead to the set up of such dashboards and making these numbers more observable.

------ Colonialism ------

Ford: Abstract Wikipedia is intended to help populate smaller Wikipedias. How will you prevent the English-speaking community and other dominant languages from drowning out local content? For example, the encyclopedia in the Cebuano language, spoken in the southern Philippines, is one of the largest because it has been flooded by thousands of automatically translated articles, which is difficult for the local Cebuano community to curate and maintain.

Vrandečić:As far as I can tell, the problem with the Cebuano Wikipedia is not the availability of content to readers of Cebuano – but that the Cebuano community has no control over this content, how much of it the community wants to integrate, how to filter the machine-generated content out, and how to change it – in short, how to take ownership of the content. These articles are not really owned and maintained by the Cebuano community, but by a single person, who, in dialog with the community, runs his bot to write these millions of articles.

But there is no proper path for the wider community to take ownership of this large set of articles. This is one thing Abstract Wikipedia aims to change: the content and the integration in the local communities will be entirely controlled by the local community. Not only whether they want to pull in Abstract Wikipedia wholesale or not, but in much more detail. And the community members will always be able to go to Abstract Wikipedia and work on the content itself. They are not just passive recipients of content, but they can make their voices heard and join in the creation and maintenance of the content across languages.

I agree with the sentiment of your question, and I do have a similar worry: it is imperative that Abstract Wikipedia does not become another tool for colonizing the knowledge sphere even more thoroughly. Even with the Cebuano community members contributing to Abstract Wikipedia, they will likely be outnumbered by community members from Europe and the US. What are your ideas that, if implemented, could help with Abstract Wikipedia becoming a true tool towards knowledge equity?

Ford: You’re right that the problem with Cebuano Wikipedia is not the availability of content but in the control over that content. What is interesting is that it was a Cebuano Wikipedian who built the first bot that dump(ed) articles on all the communes of France (over 50,000) on Cebuano Wikipedia,” according to long-time Filipino Wikipedian Josh Lim

Lim wrote that the result wasn’t only that “local content was drowned out,” but that it sparked “a numbers war among local Wikipedias” when “other Philippine-language Wikipedias tried to outdo one another in achieving high article counts.” The move to automate articles brought with it a set of values that prioritised quantity over quality, and prioritised subjects that could easily be automatically added (communes of France, for example) over locally relevant topics. It affected the ability of the small Cebuano community to grow its content organically and caused embarrassment when Filipino Wikipedians met other Wikipedians abroad as Cebuano Wikipedia was nicknamed “the Wikipedia of French communes.”

The answer, I think, is for us to think more carefully about what we mean when we say that the “community” will be able to “control” their Abstract Wikipedia project. What defines the Cebuano Abstract Wikipedia community? And how will they make decisions about the content in their repository? The problem with Cebuano Wikipedia was that concerned community members only realised the scale of the problem after the first bot had done its work and the editor running the bot had left the project. There was no consultation. And after the precedent had been set, editors consented to further bot work because they were in a race to increase the visibility of their Wikipedia project. The problem wasn’t only about the domination of Western knowledge. It was about the domination of a logic that recognises the number of articles as a heuristic for quality and/or as a way that small projects gain visibility on the project. 

Is there a way that we can, at the outset of a project, think about progress beyond quantitative heuristics towards qualitative ones? This could mean focusing not on the numbers of articles created but on the development of local principles and policies, for example. 

Can we start small and focus on inclusive deliberation with language communities beyond Wikimedia? This could mean working with only a few languages during an evaluation period in which there is a focus on changing the scope and venue for deliberation – beyond discussions with the few who have the interest, knowledge, and time to give their opinion on wiki, to equipping representatives from the language community with the knowledge necessary for them to make an informed, collective decision beyond the wiki. 

Can we provide communities with the real ability to reject the project and at different stages, using the same principles of deliberation? This won’t be as easy as how most projects are launched and conducted, but it will reap returns because it will foreground planning and discussion and lead to a decision that can be ethically justified. 

Finally, can we invite social scientists working with interested Wikimedians to study the project as it evolves, suggesting important research questions that you highlight here so that we can evaluate the extent to which Abstract Wikipedia fulfills the goals it seeks over time? A number of principles and evaluation frameworks are emerging to investigate the ethical impact of automated systems. There are some very promising evaluation frameworks and principles emerging from this research and it makes sense that Abstract Wikipedia engage with them.   

Vrandečić: I would love for us to move from a metric of number of articles to a more meaningful metric, such as the number of contributors. I think this would be a much more meaningful metric reflecting the health of our projects.

I am wary of setting up deliberative bodies that stretch beyond Wikimedia to make decisions about the Wikimedia projects. One of our strengths is the autonomy of the projects. Setting up a body that is above the actual productive community working on the wiki and possibly dictating which kind of content should and should not be written sounds incompatible with the way our projects have worked so far. Inviting new people into the Wikimedia projects, yes, absolutely, but I think that in the end what happens in a Wikipedia language edition should be decided by the active community members of that given Wikipedia.

What I can commit to is to set up a forum to explicitly discuss the ethics and the implications of the Abstract Wikipedia project, and to send out a wider invitation beyond the current Wikimedia communities. This should allow us to identify potential problems and to either design solutions or avoid parts of the project altogether. One result would indeed be to make sure that the local communities retain control of the whole process of integrating content from Abstract Wikipedia, not just nominally but also practically and deliberately. I would be honored if you’d join us in this forum.

Ford: I get what you’re saying about the problems of non-Wikimedia editors making decisions that Wikimedians have to implement. But I believe that it is a mistake to limit deliberation about the future of Wikimedia projects to the wiki. It’s a mistake because what is presented as fact on Wikimedia isn’t just another representation acting as an equal player in a sea of alternative representations. Wikipedia and the Wikimedia projects that serve it are a dominant representation that affect people well beyond the wiki and around the world because it is increasingly recognised as a global consensus view about people, places, events and things around the world. That means that there are stakeholders well beyond those represented by a small Wikipedia community who should be a part of those discussions because it is those communities who will be affected by those representations. Expanding those discussions doesn’t necessarily mean that the result will be decisions that Wikimedians can’t implement. Recent innovations in deliberation design have demonstrated how this is, indeed, possible to achieve. I am really pleased that you will set up a forum to discuss the ethics and implications of the Abstract Wikipedia project and I look forward to finding ways to expand the deliberative scope of that conversation. Thank you for your invitation! 


FN1. Abstract Wikipedia is a provisional name for this new project and the selection of a final name is still underway.

Denny Vrandečić is a computer scientist who founded Wikidata in 2012 so that information could be shared across projects and languages. This year, Vrandečić followed up with Abstract Wikipedia for more complex expressions (note Abstract Wikipedia is a provisional name for this new project and the selection of a final name is still underway). Consider the statement: “Marie Curie is the only person who received two Nobel Prizes in two different sciences.” In his contribution to the book, he describes how such a claim might be represented in an abstract and computer-understandable way and then automatically translated across languages.

Heather Ford is a scholar of Wikipedia and Internet governance and author of the chapter Rise of the Underdog. In her forthcoming book about the Wikipedia “fact factory,” Ford traces the development of the 2011 Egyptian Revolution article to show how factual claims are conceived, contested, and constructed at Wikipedia before being used by the likes of Google and Bing. It is a story about how knowledge automation offers great efficiencies while challenging some of the ideals of the Internet as global public infrastructure. 

For further reading on linked data and automation in the context of Wikipedia and Google, see Heather Ford’s 2016 work in collaboration with Mark Graham from the Oxford Internet Institute in a chapter for the book Code and the City edited by Kitchin and Perng called “Semantic Cities: Coded Geopolitics and the Rise of the Semantic Web” (pre-print), and a journal article for Environment and Planning D “Provenance, Power and Place: Linked Data and Opaque Digital Geographies” (pre-print).