Thursday, 21 May 2020

The virtue of simplicity: On machine learning models in algorithmic trading

Kristian Bondo Hansen, Copenhagen Business School
Big Data & Society 7(1), https://doi.org/10.1177/2053951720926558. First published: May 20, 2020.
Keywords: Ockham’s razor, machine learning models, algorithmic trading, distributed cognition, model overfitting, explainability

In the spring of 2018, I was—as part of my fieldwork into the use of emerging technology in securities trading in financial markets—sitting in on an industry conference on the employment of machine learning (ML) and artificial intelligence (AI) in finance. The conference took place in a rather lavish late eighteenth century building in central London, which had belonged to the freemasons before being turned into a hotel and then conference venue. On the second day of the event a labour union was having a gathering just one floor down from where I sat listening to old school financiers and new school data scientists talk about Markov Chains, unstructured data, reinforcement learning, LSTMs, autoencoders, and a lot of other very technical stuff. I remember pondering what the union people might think about the “capitalists” upstairs and whether I was perceived as one of them. Thoughts about class affiliation and whether or not I was blending in aside, I was enjoying the conference, especially my coffee break conversations with the new breed of tech savvy financiers. In one of the more accessible and less hypothetical presentations—none of the participants were interested in sharing trade secrets, just glimpses of their potentially profitable ML and AI models—a young data wizard doing quantitative risk management in a Dutch clearing bank spoke on a late-stage ML model for anomaly detection in undisclosed financial data. Because it was the first presentation of a production-ready ML model and not just a thing on the drawing board, the room was buzzing with excitement. During the Q&A the presenter was queried about the number of tests and the type and scope of data the model had been trained on. Asked why the team from the clearing bank had only performed a limited number of tests of the model, the presenter replied “because it works. And I have deadlines too!”

Besides being amusingly frank, the response by the data scientist is telling of a combination of and tension between pragmatics and tireless scientific rigour that characterise contemporary quantitative model-driven trading and investment management. Though some algorithms are immensely sophisticated engineering marvels, it is, at the end of the day, their ability to consistently make money that counts. With data scientists increasingly and rapidly replacing economists in the back, middle and front offices of trading firms, hedge funds and banks, finance seems more and more to be turning into an applied data science industry. While finding edge in markets now partly is a scientific endeavour, the end goal however remains the same: to make money. It is the challenge of devising robust, sophisticated, profitable yet understandable and thus manageable ML algorithms for trading and investment purposes that I explore in my paper ‘The virtue of simplicity: On machine learning models in algorithmic trading’. More specifically, I engage with the development of such models from the quants’ perspective and analyse their reflections on how to deal with ML techniques, vast datasets, and the dynamism of financial markets in an in many ways impatient industry. Drawing on distributed cognition theory, my argument basically is that ML techniques enhance financiers’ ability to take advantage of opportunities, but at the same time possess a degree of unavoidable complexity that developers and users need to find ways to make sense of, manage and control.

The paper shows how ML quants attempt to manage the complexity of their algorithms by resorting to simplicity as a virtuous and pragmatic rule of thumb in model development and model implementation processes. Quants consider simplicity—they are particularly fond of the Ockham’s razor principles, which says that things should not be multiplied without necessity—a heuristic that helps them manage and control machine learning model complexity. The argument for having simplicity as a rule of thumb in ML modelling is to ensure comprehensibility, interpretability and explainability. It helps frame the modelling process by making it more foreseeable and controllable, which fortifies accountability. Rather than being able to account for every little detail in learning algorithms, what quants perceive as having an understanding or “feel” for a model is a matter of grasping the algorithm’s basic logic and being capable of interpreting output. The study contributes to research on the relationship between and interaction of humans and algorithms in finance and beyond. The research that went into this paper is carried out as part of the ERC funded interdisciplinary research project ‘AlgoFinance’, which explores algorithm and model use in financial markets. Combining ethnographic field studies with large scale agent-based simulations of securities markets we try to understand how algorithms—machine learning and non-machine learning—construct and shape interaction dynamics of market actors trading with one another. The research team consists of social scientists Christian Borch (PI), Daniel Souleles, Bo Hee Min and myself, and from the hard sciences side of things Zachery David, Nicholas Skar-Gislinge, and Pankaj Kumar. In addition to the sociological network perspective on interaction of trading algorithms, we examine ways organisations and individuals—traders, portfolio managers, quants, etc.—are affected by, adapt to and try to stay on top of technological advances in the field. One of the things we hope will come of our efforts is a better understanding of the social dynamics underpinning and embedded in the thoroughly quantified and increasingly automated world of securities trading. A big part of shedding light on this social dimension of algorithmic finance is to explore the socio-material assemblages of humans and algorithms, which exactly is what I do in my paper.

Monday, 18 May 2020

Big Data and Surveillance: Hype, Commercial Logics and New Intimate Spheres

Editorial
Big Data & Society 7(1), https://doi.org/10.1177/2053951720925853. First published May 14, 2020.


Guest lead editors: Prof. Kirstie Ball*, Prof. William Webster**

* Centre for Research into Information, Surveillance and Privacy, University of St Andrews, School of Management
** Centre for Research into Information, Surveillance and Privacy, University of Stirling, School of Management

When viewed through a surveillance studies lens Big Data is instantly problematic. In comparison with its predecessors, and by virtue of its pre-emptive impulses and intimate data flows, Big Data creates a more penetrating gaze into consumers’ and service users’ lives. As Big Data draws on data streams from social and online media, as well as personal devices designed to share data, consumers have limited opportunities to opt out of data sharing, as well as difficulty in finding out what happens to their data once it is shared. In the Big Data era, consumers and service users exert comparatively less control over their personal information flows and their mundane consumption activities become highly significant and subject to scrutiny. Their subjection to the social sorting which results from the classification of those data is comparatively intensified and commercialised. Those companies who are in a position to exploit the value created by Big Data Analytics (BDA) enjoy powerful market positions.

Consequently, greater attention needs to be placed on corporate and other actors which bring surveillance practices like BDA into being. BDA as practiced predominantly takes place in organizational settings. Addressing the mid-range of BDA - the mesh of organizations which mediate between the end consumer, the organisational and societal context, and the marketer of products - reveals how the influence and power of BDA is far from a done deal. The commercial logics which drive BDA implementation are seated in promises of seamless improvements in operational efficiency and more accurate decision-making arising directly from the use of analytics. As a marketing practice, for example, BDA seek to create value from an extensive array of new data-generating sources used by consumers. The aim is to produce new insight into consumer behaviours so that they can be better targeted by marketers in real time and that their intentions can be predicted with a greater degree of accuracy However, the realisation of this ‘value’ is highly contingent. Personnel management, technology infrastructure, organizational culture, skills, and management capability are all identified as crucial components and impact on the value generated. The sheer socio-technical range and interdependency of these internal variables highlight the two issues with which this special themed issue of Big Data and Society is concerned.

The first concerns the power relations and political dynamics of BDA implementation. Adopting, enacting and complying with the demands of BDA strategies involves a rethinking of roles, relationships and identities on the part of those involved in the transformation. Significant pressure and hype has been brought to bear on non-technical organizational constituencies, such as marketers, who have been challenged by the implications of BDA and are required to reconcile their creative, qualitative approaches with an analytical world- view. Similarly, in a public service context, managers are increasingly being required to base their policy and operational decisions on new information flows embedded in BDA. They are finding that these novel technologically intensive processes are conflicting with traditional long-established norms and organisational decision-making structures.

The second concerns how practices associated with BDA extend surveillance into the intimate sphere. The surveillance practices embedded in Big Data are typically seen as commercial processes and another facet of operational efficiency and value creation. Such data can be subtle, highly nuanced and very personal. It can be collected from the home and can include information gathered within intimate, domestic spaces. Ethical concerns are recognised by practitioners, although they are still couched within a value discourse - and a robust ethics committee can ‘allow’ and ‘oversee’ the collection of such data.

Big Data succeeds in extending the scope of surveillance by co-opting individuals into the de facto surveillance of their own private lives. Through the increasingly embedded role of online social networks and location sensitive mobile devices in social activities, the boundaries between surveillance and the surveilled subject become blurred. Big Data breaks down boundaries between different sources of data, thus allowing the combination of information from different social domains. In democracies, with clearer legal protections of the line between public and private, Big Data extends existing surveillance technologies in its ability to co-opt the key economic actors - the corporations - and thus gain a window into private lives. Big data practices are also allowing powerful commercial corporations greater access to the machinery of government and public services in that they are being increasingly influential in policy-making and service delivery, as well as getting greater access to data deriving from these organisational entitles. The levels of ubiquity in terms of data collection, previously only available in tightly controlled political environments, are therefore now available universally.

A brief guide to the special theme
This theme features six articles, all of which contextualise Big Data hype within and at times counter to business and organisational logics. They explore how BDA extends surveillance across more intimate boundaries highlighting: the emotional registers of consumer; home automation and household surveillance; and the surveillance and commercialisation of children via ‘Hello Barbie’. They also examine how Big Data practices are produced, reflecting the argument that the enactment of surveillant power using BDA is not a certainty but a negotiated organisational process. This theme addresses a gap in critical scholarship on Big Data, as it explores the links between Big Data, its organisational and commercial contexts and increasing levels of intimate surveillance. The articles illustrate how business and organisational practices shape and are shaped by BDA and how the producers and consumers of Big Data are forging new intimate and intensive surveillance relationships. BDA is not as revolutionary as sometimes suggested by vocal advocates. Its implementation and use is embedded within, and shaped by, powerful institutional norms and processes- and when seen in retrospect the development of BDA is clearly an incremental path dependent process.

Thursday, 14 May 2020

Playing with machines: Using machine learning to understand automated copyright enforcement at scale

Joanne Gray, Nicolas Suzor
Big Data & Society 7(1), https://doi.org/10.1177/2053951720919963. First published April 28, 2020.
Keywords: machine learning, copyright enforcement, YouTube, content moderation, automated decision-making, Content ID


How can we understand how massive content moderation systems work? Major social media platforms use a combination of human and automated processes to efficiently evaluate the content that their users post against the rules of the platform and applicable laws. These sociotechnical systems are notoriously difficult to understand and hold to account.

In this study, we use digital methods to try to make the content moderation system on YouTube—a system that relies on both automated and discretionary decision-making and that is applied to varying types of video content—more legible for research.

Starting with a random sample of 76.7 million YouTube videos, we used the BERT language model to train a machine learning classifier to identify videos in categories that reflect ongoing controversies in copyright takedowns. Our four categories were full movies, gameplay, sports content, and ‘hacks’ (tutorials on copy control circumvention). We used this classifier to isolate and categorise a sample of approximately 13 million unique videos for further analysis.

For each of these videos, we were able to identify which videos had been removed, what reasons YouTube gave to explain their removal and, in some cases, who had requested the removal. We sought to compare trends in copyright takedowns, Content ID blocks, and terms of service removals across different categories of content. Across our entire dataset, videos were most frequently removed from YouTube by users themselves, followed by removals due to an account termination and then Content ID blocks. DMCA takedowns were the least common removal type.

One of the most important findings of this study is that we can see, at a large scale, the rates at which Content ID is used to remove content from YouTube. Previous large-scale studies have provided important insights into rates of DMCA removals but information about Content ID removals has remained imprecise, provided at a high level of abstraction by YouTube. In this paper, we provide the first large-scale systematic analysis of Content ID and Terms of Service removal rates, including comparisons with other removal types and across different categories of content.

Our analysis provides a comparison of different types of automation in content moderation, and how these different systems play out across different categories of content. We found very high rates of removals for videos associated with film piracy and this category had the highest rate of removals for terms of service violations and account terminations. Contrastingly, we found low rates of removals for game play. It appears from our data that game publishers are largely not enforcing their rights against gameplay streams and when a gameplay video is removed it is usually due to a claim by a music rightsholder.

For sports content, we found very high rates of removals for both live sporting streams and sports highlights, including high rates of terms of service violations. For this category of content it appears both copyright owners and YouTube are highly active in policing sports content on YouTube. For the ‘hacks’ category, we found high rates of removals but mostly for terms of service violations indicating that YouTube, rather than rightsholders, are more commonly taking action to remove content and terminate accounts that provide DRM circumvention information.

Overall, we found in YouTube’s heavily automated content moderation system there is substantial discretionary decision-making. We welcome the capacity for the use of discretion in automated systems. But it matters who is making the decisions and why. As nations pressure platforms to take a more active role in moderating content across the internet we must continue to advance our methods for holding decision-makers to account.

This experimental methodology has left us optimistic about the potential for researchers to use machine learning classifiers to better understand systems of algorithmic governance in operation at a large scale and over a sustained period.

Wednesday, 25 March 2020

Video abstract: Beyond algorithmic reformism

Peter Polack introduces his paper "Beyond algorithmic reformism: Forward engineering the designs of algorithmic systems" in Big Data & Society 7(1), https://doi.org/10.1177/2053951720913064. First published: March 20, 2020.

Video abstract

Text abstract
This article develops a method for investigating the consequences of algorithmic systems according to the documents that specify their design constrains. As opposed to reverse engineering algorithms to identify how their logic operates, the article proposes to design or "forward engineer" algorithmic systems in order to theorize how their consequences are informed by design constraints: the specific problems, use cases, and presuppositions that they respond to. This demands a departure from algorithmic reformism, which responds to concerns about the consequences of algorithmic systems by proposing to make algorithms more transparent or less biased. Instead, by investigating algorithmic systems according to documents that specify their design constraints, we identify how the consequences of algorithms are presupposed by the problems that they propose to solve, the types of solutions that they enlist to solve these problems, and the systems of authority that these solutions depend on. To accomplish this, this article develops a methodological framework for researching the process of designing algorithmic systems. In doing so, it proposes to move beyond reforming the technical implementation details of algorithms in order to address the design problems and constraints that underlie them.

Keywords: Critical algorithm studies, predictive policing, design studies, algorithmic bias, algorithmic opacity, algorithmic accountability

Tuesday, 24 March 2020

Changes to Big Data and Society’s Review Policy due to COVID-19.

Changes to Big Data and Society’s Review Policy due to COVID-19.

Hello to all readers, authors and reviewers. Given the challenges around COVID-19 facing many of us, it makes little sense to continue with our normal review process. So, beginning today, Big Data and Society is starting a review hiatus for the next four weeks.

This means that (1) we will pause asking for any new reviews until April 19th. (2) For papers currently in review, we will extend the time for currently assigned referee reports by 4 weeks. (3) we will extend the submission deadline of papers in revision by 4 weeks. (4) We’ll still be accepting papers during this time but will not start the review process until April 19th. (5) Decisions on our recent call for special themes will also be pushed back until the end of April.

We recognize that these changes are toughest on authors (particularly those facing tenure and promotion decisions) and are prepared to address the specific circumstances they face. We will re-assess things shortly before April 19th and adjust plans according. If you have questions, please contact Matthew Zook (Managing Editor), zook@uky.edu.

We wish everyone and their loved ones health and safety in the upcoming weeks. Stay well. Editorial Team of Big Data and Society journal


Monday, 23 March 2020

What’s the harm in categorisation? Reflections on the categorisation work of Tech 4 Good

What’s the harm in categorisation? Reflections on the categorisation work of Tech 4 Good

by Kate Sim and Margie Cheesman, Oxford Internet Institute

The UN Special Rapporteur Tendayi Achiume will soon release a report for the UN Human Rights Council on the role of digital technologies in facilitating contemporary forms of discrimination and inequality [FN1]. This report will be informed by an expert workshop which took place at UCLA in October last year and brought together academics and activists at the forefront of analysing the social, political, and ethical implications of data-driven technologies. The workshop aimed to assess the extent to which digital technologies not only reproduce, but uniquely exacerbate existing power structures. Particularly important points identified by the workshop participants were (1) the process of categorisation that underpins the design, implementation, and internal logics of digital technologies as a distinctive element, and (2) how the speed and scale of technological categorisation work reproduces and compounds social stratifications.

With this post we reflect on the workshop discussions and seek to complement the forthcoming report by examining the role of categorisation based on our respective field research on ‘Tech 4 Good’. Under the banner of ‘Tech 4 Good’, technology companies, policymakers, and civil society organizations come together in their desire to use data-driven technologies as efficient, apposite, and meaningful interventions to social concerns. But how exactly and for whom is technology ‘good’? Investigating the design and deployment of digital technologies can illuminate how these systems codify and stabilize power relations. Technological categorisation underpins this interpretive process through which affected individuals and society are codified. For example, our respective fieldwork on sexual misconduct reporting and humanitarian aid shows how the category-making work of data-driven systems maps onto and extends power structures. This is also evident in the contemporary vocabularies and frameworks for assessing the impact of digital technologies. We highlight how key stakeholders centre critiques around privacy-related harms, reducing the insidious problems of tracking, profiling and categorisation to a matter of individual choice rather than a more systemic issue. In short, this post shows why digital categorisation practices - no more so than in Tech 4 Good - demand greater public critique, accountability and reparative justice.

Notes from the field

Our respective work at the Oxford Internet Institute demonstrates how categorisation is a key part of marshalling digital technologies for ‘social good.’ For example, Sim’s research highlights how categorising violence is integral to the production and application of sexual misconduct reporting systems. Case management systems like Callisto, LiveSafe, and Vault Platform facilitate the creation of sexual misconduct reports that trigger institutional responses. The perception that sexual assault data is fixed and objective helps make these reporting tools seem accurate and trustworthy. However, gender assumptions are constructed and reinforced in the way that data is encoded in the system design. For example, users are asked to define their experience by selecting a category of violence from a prefigured list that includes categories like ‘sexual harassment’ and ‘verbal abuse.’ The implicit assumption here is that these categories are self-evident and well-defined. Yet, analysing users’ experiences searching for the ‘right’ category demonstrates how these reporting tools’ categories--as a constructed system of defining and sorting—fall short in capturing a range of users’ experience of sexual misconduct. “The body,” as Bowker and Star (2000) write, “cannot be aligned with the classification system” (p.190).

The impact of these categories extend beyond moments of disconnect on the user’s end. The system vendors’ assumptions that construct and guide these categories help create greater credibility to those users who can successfully complete the form. The system vendors appeal to data’s perceived objectivity and incorruptibility, which sociologist David Beer terms the ‘data imaginary’ (2017), to ascribe greater credibility to those users’ claims. Those who are able to conform to the system’s classificatory logic are rewarded as credible reporters, while those who cannot see their experiences marginalized.

Cheesman’s research in the aid industry demonstrates how categorisation is the basis of resource allocation. Aid is distributed according to metrics of refugee vulnerability, including bodily abilities, sexual habits, food consumption and credit scores. New data technologies used in humanitarian categorisation practices structure the conditions of recognition and support for refugees. Yet, the same
technologies are routinely depoliticised as neutral means of better targeting and delivering services. For example, when blockchain’s automated consensus algorithms facilitate ‘neutral’ information-sharing between aid organisations, and biometric identity checks are understood as an efficient, anti-fraud advancement in humanitarian bureaucracy. Veracity is ascribed to both these technologies: biological information is seen as accurate and non-contentious, and blockchain as constructing unbiased, incorruptible, real-time records.

As a result, these de-politicised approaches to humanitarian digitalisation stabilise categorisation practices as objective, eclipsing intersectional issues about category-making, discrimination and mobility control. Digital identification can adversely affect certain groups by fixing them to a unitary set of categories: infrastructures designed to provide legal identity and protection are used to denationalise and repatriate stateless people, as with the Rohingya in Myanmar and Bangladesh (Taylor & Mukiri-Smith 2019; Madianou 2019). Moreover, categories are not just political, but gendered and racialised - for example in ethnic groupings and identifications of ‘head of household’ in refugee camps. In all these cases, there is little capacity for refugee populations to resist, contest, or define their own subjectivities.

Critiques of digital humanitarianism often centre on individual technologies and individual privacy rather than systemic issues of categorisation. But evidence of protest against biometrics or smartcards (Al Jazeera 2018), of privacy breaches (Cornish 2017) or physically dangerous system avoidance (Latonero & Kift 2018:7) should not be the only point at which computational logics are questioned. These logics normalise power relations, and maintain and extend structures of North-South domination by the aid industry. Paternalistic approaches are both produced and caused by the exclusion of refugees from humanitarian categorisation and decision-making.

What’s the harm in categorisation?

The speed, scale, and scope of technological categorisation not only reproduces social stratifications and injustices; it also makes those asymmetries less visible.Workshop attendees highlighted how the approach favoured by technology companies, journalists and policymakers to primarily recognise the most tangible or immediate harms of digital technologies overlooks and obfuscates the insidious practices and impacts of categorisation. The focus on harm is rooted in current debates that individualise digital inequality in terms of personal choice and ethics. Violations of privacy continue to be the dominant metric through which stakeholders conceptualize the erosion of bodily and digital rights. Increasingly, the institutionalization of ethics as a strategy of self-regulation rather than public accountability illustrates the technology sector’s continuing hold over our collective language and imagined interventions (Metcalf, Moss & boyd 2019).

In response to these conceptual deficits, workshop attendees called for a reorientation of critical inquiry and intervention towards reparative justice. Beyond demanding more systematic mechanisms of transparency and accountability around digital categorisation practices, they asked: what are data subjects owed, who should be held responsible, and how? Proposed solutions to digital inequality include new models of ‘self-sovereign’ data ownership, control and monetisation for individuals (Wang & De Filippi 2020), as well as more collectivist  ‘digital commons’ approaches (Prainsack 2019). However, if and how these could remedy asymmetries in categorisation power without producing exclusionary effects remains to be seen (ibid). As the UN 2020 report will also highlight, researchers, technology actors, activists, and policymakers must work in concert to develop more robust heuristics and frameworks for apprehending and addressing the structural nature of digital inequality.

----
[FN1] This report will be titled: 2020 Human Rights Council Report of the United Nations Special Rapporteur on Contemporary Forms of Racism, Racial Discrimination, Xenophobia and Related Intolerance.



Works Cited

Al Jazeera (2018) Violence stalks UN’s identity card scheme in Rohingya camps. https://www.aljazeera.com/news/2018/11/violence-stalks-identity-card-scheme-rohingya-camps-181122075307535.html

Beer, David. Metric Power. 1st ed. 2016 edition. Palgrave Macmillan, 2016.

———. “The Data Analytics Industry and the Promises of Real-Time Knowing: Perpetuating and Deploying a Rationality of Speed.” Journal of Cultural Economy 10, no. 1 (January 2, 2017): 21–33. https://doi.org/10.1080/17530350.2016.1230771.

Bowker, Geoffrey C. and Susan Leigh Star. Sorting Things Out: Classification and Its Consequences. New Ed edition. Cambridge, Mass.: MIT Press, 2000.

Cornish, L. (2017). New security concerns raised for RedRose digital payment systems | Devex. Devex, 1–6. Retrieved from https://www.devex.com/news/new-security-concerns-raised-for-redrose-digital-payment-systems-91619.

Latonero, M., & Kift, P. (2018). On Digital Passages and Borders: Refugees and the New Infrastructure for Movement and Control. Social Media and Society, 4(1).

Madianou, M. (2019). Technocolonialism: Theorizing Digital Innovation and Data Practices in Humanitarian Response. Social Media + Society, 1–33. https://doi.org/10.1177/2056305119863146

Metcalf, Jacob, Emanuel Moss, and danah boyd. “Owning Ethics: Corporate Logics, Silicon Valley, and the Institutionalization of Ethics.” Social Research: An International Quarterly, 82, no. 2 (2019): 449–76.

Prainsack, B. (2019). Logged out: Ownership, exclusion and public value in the digital data and information commons. Big Data and Society, 6(1), 1–15. https://doi.org/10.1177/2053951719829773

Taylor, Linnet, and Hellen Mukiri-Smith. 2019. “Global Data Justice : Framing the (Mis)Fit between Statelessness and Technology.” European Network on Statelessness. 2019. https://www.statelessness.eu/blog/global-data-justice-framing-misfit-between-statelessness-and-technology.

Wang, F., & Filippi, P. De. (2020). Self-Sovereign Identity in a Globalized World : Credentials-Based Identity Systems as a Driver for Economic Inclusion. 2(January), 1–22. https://doi.org/10.3389/fbloc.2019.00028

Saturday, 7 March 2020

How biased is the sample? Reverse engineering the ranking algorithm of Facebook’s Graph application programming interface

Justin Chun-ting Ho
Big Data & Society 7(1), https://doi.org/10.1177/2053951720905874. First published February 17, 2020.
Keywords: Bias detection, data mining, Facebook pages, application programming interface, social media research

Since November 2017, Facebook has introduced a new limitation on the maximum amount of page posts retrievable through their Graph application programming interface (API). However, there is limited documentation on how these posts are selected (Facebook 2017). In this article, I assess the bias caused by the new limitation by comparing two datasets of the same Facebook page, a full dataset obtained before the introduction of the limitation and a partial dataset obtained after. To establish generalisability, I also replicate the findings with data from another Facebook page.

This paper demonstrates that posts with high user engagement, Photo posts and Video posts, are over-represented while Link posts are under-represented. Top-term analysis reveals that there are significant differences in the most prominent terms between the full and partial dataset. This paper also reverse-engineered the new application programming interface’s ranking algorithm to identify the features of a post that would affect its odds of being selected by the new API. The estimated model posits that post types, Likes, Angry, Shares, and Likes on Comment are significant predictors. Sentiment analysis reveals that there are significant differences in the sentiment word usage between the selected and non-selected posts.

These findings have significant implications for research that use Facebook page data collected after the introduction of the limitation:
  • The under-representation of Link posts means that a significant amount of link-sharing activities would become invisible from the API.
  • There is no evidence to support the common expectation that the API would rank posts based on the amount of Likes and Comments. While the selected posts seem to have more Likes and Comments, other features also have an effect on the odds of being selected.
  • It is questionable to assume that the new API would return all the posts with the highest user engagement. Even though it is observed that the selected posts on average have higher user engagement, some highly commented and liked posts might not be selected due to the effect of other features.
  • Posts of certain linguistic styles can be filtered out as the new API tends to return posts with more emotional texts.
  • Non-random factors might be influencing the representation of most prominent terms in the selected posts, which could lead to bias in text models.
However, it is important to note that the data retrieved from the Graph API is still a useful resource that enables a wide range of research methods. We should not stop using the data because of the above-mentioned issues. Echoing Rieder et al. (2015), the potential bias calls for caution, prudence, and critical attention when using and interpreting the data. Uncovering the bias of the ranking algorithm will help researchers to better support their research results.

References:

Facebook (2017) /page-id/feed. Available at: https://developers.facebook.com/docs/graph-api/reference/v2.11/page/ feed (accessed 31 January 2020).

Rieder B, Abdulla R, Poell T, et al. (2015) Data critique and analytical opportunities for very large Facebook pages: Lessons learned from exploring “We are all Khaled Said”. Big Data & Society 2(2): 1–22. https://doi.org/10.1177/2053951715614980