Friday, 21 January 2022

Alternative data and sentiment analysis: prospecting non-standard data in machine learning-driven finance

Kristian Bondo Hansen (@BondoHansen) and Christian Borch (@CboMpp)

The first three months of 2021 saw two remarkable events shake the global economy and the world of finance. One was the meme-stock short squeeze of aggressive short-selling hedge funds driven by retail investors, who mobilised on social media platforms—most notably Reddit’s sub-site WallStreetBets—and used the fingertip market access provided by trading apps like Robinhood to rally around and effectively run up the price of ailing American companies such as the video game store franchise GameStop and cinema chain AMC Theatres. This event made conventional views on the drivers of prices in the financial markets topsy-turvy and directed the attention of investors to market chatter in social media communities. The other event was the Ever Given’s clogging of the Suez Canal; a freak accident resulting in a week-long blocking of the narrow Egyptian waterway through which 30% of all global container traffic travels.
What unites these two events? From an investor perspective, both events showed the predictive potential in data sources from outside of standard market data (price, trade, volume, etc.). Closely monitoring freight routes via GPS tracking and inventory analysis using satellite imagery of container terminals can help detect frictions along the supply chain that may eventually have a negative impact on individual firms’ financial statements. Similarly, keeping an ear to the buzz on social media might prove an inside track to predictions of retail rallies in individual stocks.

In our article ‘Alternative data and sentiment analysis: prospecting non-standard data in machine learning-driven finance’, we examine this disparate category of heterogeneous data sources—including social media data, GPS tracking data, sensor data, satellite imagery, credit card transaction data, and more—which market professionals call ‘alternative data’. We focus mainly on social media data, scraped from the web, and parsed through Natural Language Processing (NLP) machine learning algorithms. Drawing on interviews with investment managers and traders using alternative data, as well as intermediaries sourcing and vending such data, we show how alternative data are viewed, used, monetised, and exploited by market professionals.

Our key argument is that alternative data should always be considered in relation to the analytics tools with which patterns are extracted, signals discovered, or anomalies detected in big data sets. A crucial task in alternative data use is to render data amenable to analysis with the analytics tools available; a type of standardisation effort that data scientists call ‘prospecting’. As the financial industry’s interest in alternative data grows, ever-more data sources undergo prospecting and thus, are rendered finance relevant. We propose to see this development as financialisation on the data level—the continuous appropriation of data that could potentially be transformed into valuable market insights—and argue that it prompts new concerns about how to govern this ever-expanding category of heterogeneous data both inside individual firms and across the industry.

Wednesday, 12 January 2022

The ‘Media-Use-No-Trust’ Paradox

by David Mathieu and Jannie Møller-Hartley

In the autumn of 2018, European email inboxes overflowed with your data is safe’ messages from all sorts of obscure companies and business that we once had engaged with. This was a direct consequence of the GDPR regulation implemented in the European Union. As researchers we were puzzled with how citizens might respond to such messages, and what it meant that citizens in their daily lives were entangled with platforms and apps that collect, analyse, and make predictions based on their data. Both as we move around the city with transportation and as we engage with teachers in schools, or with our doctors via email consultation. Denmark, being the context of this study, forms an interesting case study because of the extremely high internet-penetration, the high degree of digitalisation and the high trust in public authorities in the country.

We asked ourselves: how is trust in datafied media negotiated in our datafied lives? What does the use of media, our dependency and our entanglement to them mean for such negotiations? Our society relies heavily on media, and hence citizens’ perception of data must be made through the prism of their relationship with media. Additionally, we were wondering how might the implementation of GDPR – a framework meant to protect citizens’ data  affect the negotiation of our datafied everyday lives?

We decided that the questions were so complex that we needed to engage qualitatively with all the nuances in citizen’s daily negotiations – negotiations they might not even be aware they are having, or that they might not recall in an interview situation. We opted for the method of focus group, having different groups of citizens discuss these issues and hoped that some prompts to make them discuss would bring some of these inner negotiations to the surface. During a cold and dark winter in Roskilde, we brought people from different occupations and of different ages together to discuss with us. We made them map all the platforms and apps they meet and use on a daily basis, and discuss how they trust each of them and what rationales they use for doing so (our method helped participants make explicit what we in the analysis call ‘heuristics of trust’. 

We saw five sets of heuristics guiding the trust assessments of citizens: 1) characteristics of media organisations, 2) old media standards, 3) context of use and purpose, 4) experiences of datafication and 5) understandings of datafication. Interestingly, the participants used their previous experiences to guide their understanding of how they could respond to the chilling effects or anxieties they were experiencing with having their data collected and their media use monitored. They considered what risk is associated with which type of data. In practice, this meant that they did not mind giving up data about their shoe-size, film preferences or what they have recently bought on Amazon, but were naturally more worried about data on their health and their financial situation. 

All in all, citizens are guided by a partial, embedded ‘structures of perception’ and are enticed into trusting datafied media in the context of their everyday lives. They may be highly concerned by the datafication of the media they use, but they use them heavily nevertheless. In fact, they recognised that some of the media they use the most are also the ones they trust the least with their data. Rationally, we would expect that a lack of trust should lead to an adjustment in their use of media, but our data show that this is not the case at all. Their assessment of trust is an ongoing and practical negotiation that is weighted against the benefits they get from media and the entanglement of their everyday life in media. We also found that trust is not necessarily improved by users having to read pages and pages of consent forms, cookie declarations, etc. Guided by their immediate need of use, adapting their previous understandings of datafication (as knowledge increases) on an ongoing basis, and positioning themselves towards an urge to consider data privacy seriously, citizens’ trust assessment is a complex process affected by many other things than ‘just’ regulation or rational behaviour. This is a paradox, which regulators, businesses, and scholars should recognise in the next generation of GDPR regulative measures.


Monday, 20 December 2021

The datafication revolution in criminal justice: An empirical exploration of frames portraying data-driven technologies for crime prevention and control

by Anita Lavorgna (@anitalavorgna) and Pamela Ugwudike (@PamelaUgwudike)


The article we recently published presents and discusses the findings of a review of multidisciplinary academic abstracts on the data-driven algorithms shaping (or with the clear potential to shape) decision making across several criminal justice systems. Over the years, increased attention has been paid to the possibilities for big data analytics to inform law enforcement investigations, sentence severity, parole decisions, criminal justice resource allocation, and other key decisions. In some jurisdictions, data-driven algorithms are already deployed in these high-stakes and sensitive contexts, with major real-life implications. While their use can potentially enhance the efficiency of certain routines and time-consuming tasks, they can also give rise to adverse outcomes (notably racial and other types of systemic bias, along with other ethical concerns). Given the complexity of human behaviour, and other problems such as the reliance of some algorithms on flawed administrative data, algorithmic outputs (e.g., risk predictions) can be spurious. But key stakeholders such as the justice systems that deploy them and the general public might overestimate the reliability of these results, as they might not fully understand the complex mechanisms behind them.


While working on our research on cybercrimes and cyber harms (Lavorgna), and conduits of bias in predictive algorithms deployed in justice systems (Ugwudike), which are often inherently interdisciplinary research area, allowing us to read across a wide range of publications from several disciplines, we noticed how the portrayal of data-driven tools was very different depending on the disciplinary take. Similarly, by reading declarations by policy makers, the security industry, as well as the creators, vendors, and other proponents of the algorithms, we could not avoid noticing a certain hype around the effectiveness of these tools, in contrast with our own research experience. Intrigued by this puzzle, we decided to carry out some research and move beyond our anecdotical experience.  


In our contribution, we now propose a typology of frames for understanding how relevant technologies are portrayed, and we elucidate how notions of sociotechnical imaginaries and access to digital capital are of the upmost importance in explaining differences in how the value and impact of the technologies are framed. We hope that our work can help further critical debates not only on algorithmic harms, but also on the importance of truly interdisciplinary research to facilitate the inclusion of a broader range of perspectives in our current, and unavoidable, datafication challenge.

Tuesday, 14 December 2021

BD&S Dec/Jan break

The editorial team of the journal Big Data & Society will have a break from December 19th to January 8th.
Please accept any delays in processing and reviewing your submission, and in related correspondence during that time. Thank you!

Happy Holidays!

Saturday, 11 December 2021

New book: Data Practices - Making up a European People

A new book by Founding Editor and Editor-in-Chief Evelyn Ruppert, co-edited with Stephan Scheel, has been published by Goldsmiths Press.  The book, Data Practices: Making up a European People, is open access and a pdf can be downloaded here: Data Practices: Making Up a European People | Goldsmiths Press.

The book includes contributions from an interdisciplinary team of social science researchers who were part of a five-year ERC funded project led by Evelyn Ruppert, Peopling Europe: How Data Make a People (ARITHMUS – Peopling Europe): Baki Cakici, Francisca Grommé, Stephan Scheel, Ville Takala, and Funda Ustek-Spilda. The book develops a conception of data practices to analyze and interpret findings from their collaborative ethnographic multisite fieldwork of national and international statistical organisations across Europe. Drawing on theories that are part of what is more generally referred to as the ‘practice turn’ in contemporary social sciences, the book develops a conception of data practices that speaks to theoretical, political and practical issues taken up in many articles published in Big Data & Society. Specifically, it approaches data practices not simply as reflecting populations but as performative in two senses: they simultaneously enact—that is, “make up”—a European population and, by so doing—intentionally or otherwise—also contribute to making up a European people.

Friday, 29 October 2021

Editorial Board Update

We are very happy to announce the renewal of the Editorial Board for Big Data and Society. We take this step every three years to ensure that we refresh and expand the range of researchers engaged in the journal.

We are looking forward to the engaging and productive interactions that are to come.

We also wish to thank everyone who has previously served on the Editorial Board for helping make the journal the success that it is today. 

Thursday, 21 October 2021

A Guest Speaker Series about the Covid-19 Infodemic phenomenon by the Social Media Lab at Ryerson University

In the new Big Data & Society Special Theme “Studying the COVID-19 Infodemic at Scale”, we see this rapid circulation of both correct and misinformation related to Covid-19 and other health issues as infodemic. Our editors, Anatoliy Gruzd, Manlio De Domenico, Pier Luigi Sacco, Sylvie Briand, offer cutting-edge approaches to studying the effects, strategies and challenges associated with the emergent infodemic. They have sought to open up a space for investigating and discussing this phenomenon by selecting six research articles and four commentaries from six countries (see more details here).

Our editors have extended the discussion of the Covid-19 infodemic by initiating a Guest Speaker Series with the Social Media Lab at the Ted Rogers School of Management at Ryerson University from Oct 7th to Nov 19th. The talks are: 

  • Oct 7 (Thu) 2 pm. Kai-Cheng Yang, Observatory on Social Media, Indiana University, The COVID-19 Infodemic: Twitter versus Facebook
  • Oct 21 (Thu), 2 pm. Mark Green, Department of Geography & Planning, University of Liverpool. Identifying how COVID-19-related misinformation reacts to the announcement of the UK national lockdown: An interrupted time-series study
  • Oct 28 (Thu), 2 pm. Jon Roozenbeek, Department of Psychology, University of Cambridge. Towards psychological herd immunity: Cross-cultural evidence for two prebunking interventions against COVID-19 misinformation
  • Nov 4 (Thu), 2 pm. Kacper Gradon, Department of Security and Crime Science, University College London. Countering misinformation: A multidisciplinary approach
  • Nov 11 (Thu), 2 pm. Michael Robert Haupt, Department of Cognitive Science, University of California San Diego. Identifying and characterizing scientific authority-related misinformation discourse about hydroxychloroquine on twitter using unsupervised machine learning
  • Nov 19 (Fri) 2 pm. Paola Pascual-Ferrá, Department of Communication, Loyola University Maryland. Toxicity and verbal aggression on social media: Polarized discourse on wearing face masks during the COVID-19 pandemic

If you wish to attend this event in real-time, please register here in order receive a Google Meet access link to each talk. All talks are recorded and available to access (after the real-time event) via the same webpage.

Everyone is welcome to join the Guest Talk Series. The guest talks will focus on the questions:

1) what are the benefits and challenges of detecting and combating the spread of Covid-19 misinformation on social media?

2) how can the digital method of data-tracing function as a useful methodology to examine and mitigate the risks that an infodemic could pose to individuals and society?