Wednesday, 27 April 2022

Hey Siri, can you hear me now? A framework for building natural language processing tools that advance linguistic justice

Nee J, Smith GM, Sheares A, Rustagi I. Linguistic justice as a framework for designing, developing, and managing natural language processing tools. Big Data & Society. January 2022. doi:10.1177/20539517221090930

The increasing ubiquity of natural language processing (NLP) tools that learn from and use human language is undeniable. NLP-powered tools now produce records of court hearings, inform job interview analyses, respond to our verbal requests through smartphones, and more. However, NLP tools don’t serve all people equally: they often perform better for certain speakers and advance linguistic profiling. So, a critical question remains: how can these technologies equitably serve all members of society, regardless of their language background?

The concept of linguistic justice can be used to frame NLP tool development in a way that
centres the needs of all users, rather than prioritising speakers of privileged languages like “Standard” English. Linguistic justice is achieved when all individuals are granted equitable access to social, political, and economic life, regardless of their linguistic repertoire. Linguistic justice, then, requires that NLP tools serve diverse speakers and signers equitably.

Our commentary examines in detail two issues with current NLP tool development. First, if NLP tools learn from datasets that lack sufficient data from speakers of minoritised language varieties, those tools may underperform for those users. Secondly, NLP systems can use language to infer information about the identities of users - a process known as linguistic profiling. Even when protected information (e.g., race, gender) is not directly provided to an NLP system, the system may still infer a users’ identity from features of their language use. Inferred characteristics may then be used to mediate access to goods, services, and opportunities, resulting in unlawful discrimination.

We present nine specific actions that researchers, developers, and business leaders can take to design, develop, and manage NLP systems that advance linguistic justice. This includes, for example, working with diverse language communities in participatory and empowering ways, ensuring language data is labeled by people familiar with the particular language variety, and examining and altering power structures so that the needs and perspectives of those at the margins are prioritised.

Instead of being comfortable with the status quo, this work requires imagining and working towards a world where users of all language varieties are able to equitably access social, economic, and political life. It requires rethinking how we collect data and what data we value in NLP development. Our nine actions provide a path forward toward that world – whereby NLP systems can advance linguistic justice and thereby, social justice.

Tuesday, 19 April 2022

In Search of the Citizen

Heather Broomfield & Lisa Reutter

Broomfield H, Reutter L. In search of the citizen in the datafication of public administration. Big Data & Society. January 2022. doi:10.1177/20539517221089302
How are citizen perspectives problematised and included in policy and practitioner discourse in the datafication of public administration?

During an initial analysis of empirical material collected to map the Norwegian public sector, we were struck by the discursive absence of citizens in the realisation of this all-encompassing administrative reform. This sparked our curiosity, leading us to raise our gaze from the organisational to the system level and investigate who was invited to participate in policy formation and how citizens were described, in both the resulting documents and  practitioner discourse. 

The Norwegian data-driven context is particularly interesting to investigate. The state has collected vast amounts of data on the population for decades, the recirculation of which can make data-driven public administration realisable on a scale unimaginable in many other countries. Norway is also characterised by a corporative pluralism where collaboration with externals and inter-dependent decision making with interest organisations and business representative organisations, is deemed fundamental to policy making (Rokkan, 1966). 

Unexpectedly, we identified a paternalistic and top-down technocratic approach to citizen engagement with non-participation particularly apparent at the policy making level. Citizens and civil society are reduced to a passive but demanding ‘user’ to be served by the public sector. This is in direct contrast to active engagement with the private sector during all phases - from policy production through to implementation. 

Datafication often escapes democratic decision-making as the context, values, and agendas of this administrative reform are obscured from citizens and civil society. We hope this paper sparks interest among practitioners and scholars alike. 

Sunday, 13 March 2022

Special Theme on Data, Power, and Racial Formation

Data, Power, and Racial Formation: Situating Data within Interlocking Systems of Oppression

The promise of big data lies in its ability to draw connections and reveal patterns about social life. There are growing concerns, however, that the reliance on big data can threaten not only to automate discrimination and oppression but also to become central mechanisms through which racism operates. 

Critical observers encourage closer attention be paid to how power manifests in and through the application of big data as well as through automated systems and so-called ‘smart’ technologies. This special issue heeds their advice by exploring how race and racism are entangled in the collection and use of data. The papers in this collection showcase a range of interdisciplinary insights to demonstrate how data studies might benefit from deeper engagement with intellectual schools of thought concerned with race and racism—both theoretically and practically. They illustrate how theories of race and racism can enhance understandings of big data’s material impacts and can inform approaches to addressing these impacts. Authors make productive inroads regarding how data emerge in and through racial projects as they intersect with systems of class, colonialism, disability, gender, and sexuality. 

Looking at how big data reflects entanglements of racialised power prompts a range of critical questions. How do modes of datafication normalise racial classification systems and mask their sociocultural underpinnings? To what extent can big data work in the service of liberatory agendas? What are the opportunities and risks of practices and systems that promise more equitable outcomes? 

In answering these questions, this collection captures connections and tensions between data and racial formation. Racial formation, often associated with work by sociologists Michael Omi and Howard Winant, captures the relationships between socio-economic and political changes and the shifting nature and value of racial categories, including the meanings they come to reflect. 

The collection thus builds upon longstanding concerns about data and racialised power. Its aim is to bring them into dialogue with more recent discussions in critical data studies. Such work includes, but is not limited to, how data animate knowledge systems in ways that may be actively harm Black and Indigenous peoples, as well as other minoritised communities, and how race and racialised value systems appear in datasets, machine learning models, and ‘smart’ home devices. In addition to contributing to scholarly debates about how data are mobilised to generate racial formations, authors’ insights support strategies for anti-racist movements by drawing attention to how they can be challenged and disrupted.

Papers in this special issue address how data become implicated within the interlocking systems of domination and oppression, doing so in ways that are attentive to effects that emerge in everyday lives and livelihoods. Phan and Wark reconsider how datafied processes evince new shifts in processes of racialisation. Hatch examines how the governance of COVID-19-related health data became a site where data about racial death became framed as a matter for liberal reform but served to support racist social systems. Henne, Shelby, and Harb demonstrate how racial capitalism can advance understandings of data capital and the inequalities it can generate, doing so through an in-depth study of digital platforms used for intervening in gender-based violence. Sooriyakumaran is similarly concerned about racialised inequalities etched and shaped by capitalist relations, considering how they manifest in digitised residential tenancy databases in Australia. Crooks (2021) shows how non-profit efforts to make public schools more “data driven” can be understood as racial projects. Anantharajah (2021) attends to how racial formation takes shape through colonial data projects, drawing on ethnographic research on climate finance governance conducted in Fiji. 

This special thematic issue offers one set of responses to the pressing need to critically examine how race and racism are entangled in the collection and use of data. It brings together longstanding and emergent concerns regarding data’s role within racial formation. It also reflects on recent cultural and political developments as well as geopolitical and socio-technical shifts. In doing so, this collection of papers marks an attempt to illuminate how data become implicated within the interlocking systems of domination and oppression that affect everyday lives and livelihoods. We recognise that many others are working on similarly projects. We therefore hope the collection serves as a productive resource for readers from a range of fields and contributes to a generative dialogue that crosses disciplinary boundaries.

Friday, 21 January 2022

Alternative data and sentiment analysis: prospecting non-standard data in machine learning-driven finance

Kristian Bondo Hansen (@BondoHansen) and Christian Borch (@CboMpp)

The first three months of 2021 saw two remarkable events shake the global economy and the world of finance. One was the meme-stock short squeeze of aggressive short-selling hedge funds driven by retail investors, who mobilised on social media platforms—most notably Reddit’s sub-site WallStreetBets—and used the fingertip market access provided by trading apps like Robinhood to rally around and effectively run up the price of ailing American companies such as the video game store franchise GameStop and cinema chain AMC Theatres. This event made conventional views on the drivers of prices in the financial markets topsy-turvy and directed the attention of investors to market chatter in social media communities. The other event was the Ever Given’s clogging of the Suez Canal; a freak accident resulting in a week-long blocking of the narrow Egyptian waterway through which 30% of all global container traffic travels.
What unites these two events? From an investor perspective, both events showed the predictive potential in data sources from outside of standard market data (price, trade, volume, etc.). Closely monitoring freight routes via GPS tracking and inventory analysis using satellite imagery of container terminals can help detect frictions along the supply chain that may eventually have a negative impact on individual firms’ financial statements. Similarly, keeping an ear to the buzz on social media might prove an inside track to predictions of retail rallies in individual stocks.

In our article ‘Alternative data and sentiment analysis: prospecting non-standard data in machine learning-driven finance’, we examine this disparate category of heterogeneous data sources—including social media data, GPS tracking data, sensor data, satellite imagery, credit card transaction data, and more—which market professionals call ‘alternative data’. We focus mainly on social media data, scraped from the web, and parsed through Natural Language Processing (NLP) machine learning algorithms. Drawing on interviews with investment managers and traders using alternative data, as well as intermediaries sourcing and vending such data, we show how alternative data are viewed, used, monetised, and exploited by market professionals.

Our key argument is that alternative data should always be considered in relation to the analytics tools with which patterns are extracted, signals discovered, or anomalies detected in big data sets. A crucial task in alternative data use is to render data amenable to analysis with the analytics tools available; a type of standardisation effort that data scientists call ‘prospecting’. As the financial industry’s interest in alternative data grows, ever-more data sources undergo prospecting and thus, are rendered finance relevant. We propose to see this development as financialisation on the data level—the continuous appropriation of data that could potentially be transformed into valuable market insights—and argue that it prompts new concerns about how to govern this ever-expanding category of heterogeneous data both inside individual firms and across the industry.

Wednesday, 12 January 2022

The ‘Media-Use-No-Trust’ Paradox

by David Mathieu and Jannie Møller-Hartley

In the autumn of 2018, European email inboxes overflowed with your data is safe’ messages from all sorts of obscure companies and business that we once had engaged with. This was a direct consequence of the GDPR regulation implemented in the European Union. As researchers we were puzzled with how citizens might respond to such messages, and what it meant that citizens in their daily lives were entangled with platforms and apps that collect, analyse, and make predictions based on their data. Both as we move around the city with transportation and as we engage with teachers in schools, or with our doctors via email consultation. Denmark, being the context of this study, forms an interesting case study because of the extremely high internet-penetration, the high degree of digitalisation and the high trust in public authorities in the country.

We asked ourselves: how is trust in datafied media negotiated in our datafied lives? What does the use of media, our dependency and our entanglement to them mean for such negotiations? Our society relies heavily on media, and hence citizens’ perception of data must be made through the prism of their relationship with media. Additionally, we were wondering how might the implementation of GDPR – a framework meant to protect citizens’ data  affect the negotiation of our datafied everyday lives?

We decided that the questions were so complex that we needed to engage qualitatively with all the nuances in citizen’s daily negotiations – negotiations they might not even be aware they are having, or that they might not recall in an interview situation. We opted for the method of focus group, having different groups of citizens discuss these issues and hoped that some prompts to make them discuss would bring some of these inner negotiations to the surface. During a cold and dark winter in Roskilde, we brought people from different occupations and of different ages together to discuss with us. We made them map all the platforms and apps they meet and use on a daily basis, and discuss how they trust each of them and what rationales they use for doing so (our method helped participants make explicit what we in the analysis call ‘heuristics of trust’. 

We saw five sets of heuristics guiding the trust assessments of citizens: 1) characteristics of media organisations, 2) old media standards, 3) context of use and purpose, 4) experiences of datafication and 5) understandings of datafication. Interestingly, the participants used their previous experiences to guide their understanding of how they could respond to the chilling effects or anxieties they were experiencing with having their data collected and their media use monitored. They considered what risk is associated with which type of data. In practice, this meant that they did not mind giving up data about their shoe-size, film preferences or what they have recently bought on Amazon, but were naturally more worried about data on their health and their financial situation. 

All in all, citizens are guided by a partial, embedded ‘structures of perception’ and are enticed into trusting datafied media in the context of their everyday lives. They may be highly concerned by the datafication of the media they use, but they use them heavily nevertheless. In fact, they recognised that some of the media they use the most are also the ones they trust the least with their data. Rationally, we would expect that a lack of trust should lead to an adjustment in their use of media, but our data show that this is not the case at all. Their assessment of trust is an ongoing and practical negotiation that is weighted against the benefits they get from media and the entanglement of their everyday life in media. We also found that trust is not necessarily improved by users having to read pages and pages of consent forms, cookie declarations, etc. Guided by their immediate need of use, adapting their previous understandings of datafication (as knowledge increases) on an ongoing basis, and positioning themselves towards an urge to consider data privacy seriously, citizens’ trust assessment is a complex process affected by many other things than ‘just’ regulation or rational behaviour. This is a paradox, which regulators, businesses, and scholars should recognise in the next generation of GDPR regulative measures.


Monday, 20 December 2021

The datafication revolution in criminal justice: An empirical exploration of frames portraying data-driven technologies for crime prevention and control

by Anita Lavorgna (@anitalavorgna) and Pamela Ugwudike (@PamelaUgwudike)


The article we recently published presents and discusses the findings of a review of multidisciplinary academic abstracts on the data-driven algorithms shaping (or with the clear potential to shape) decision making across several criminal justice systems. Over the years, increased attention has been paid to the possibilities for big data analytics to inform law enforcement investigations, sentence severity, parole decisions, criminal justice resource allocation, and other key decisions. In some jurisdictions, data-driven algorithms are already deployed in these high-stakes and sensitive contexts, with major real-life implications. While their use can potentially enhance the efficiency of certain routines and time-consuming tasks, they can also give rise to adverse outcomes (notably racial and other types of systemic bias, along with other ethical concerns). Given the complexity of human behaviour, and other problems such as the reliance of some algorithms on flawed administrative data, algorithmic outputs (e.g., risk predictions) can be spurious. But key stakeholders such as the justice systems that deploy them and the general public might overestimate the reliability of these results, as they might not fully understand the complex mechanisms behind them.


While working on our research on cybercrimes and cyber harms (Lavorgna), and conduits of bias in predictive algorithms deployed in justice systems (Ugwudike), which are often inherently interdisciplinary research area, allowing us to read across a wide range of publications from several disciplines, we noticed how the portrayal of data-driven tools was very different depending on the disciplinary take. Similarly, by reading declarations by policy makers, the security industry, as well as the creators, vendors, and other proponents of the algorithms, we could not avoid noticing a certain hype around the effectiveness of these tools, in contrast with our own research experience. Intrigued by this puzzle, we decided to carry out some research and move beyond our anecdotical experience.  


In our contribution, we now propose a typology of frames for understanding how relevant technologies are portrayed, and we elucidate how notions of sociotechnical imaginaries and access to digital capital are of the upmost importance in explaining differences in how the value and impact of the technologies are framed. We hope that our work can help further critical debates not only on algorithmic harms, but also on the importance of truly interdisciplinary research to facilitate the inclusion of a broader range of perspectives in our current, and unavoidable, datafication challenge.

Tuesday, 14 December 2021

BD&S Dec/Jan break

The editorial team of the journal Big Data & Society will have a break from December 19th to January 8th.
Please accept any delays in processing and reviewing your submission, and in related correspondence during that time. Thank you!

Happy Holidays!