Big Data & Society: How biased is the sample? Reverse engineering the ranking algorithm of Facebook’s Graph application programming interface

Justin Chun-ting Ho
Big Data & Society 7(1), https://doi.org/10.1177/2053951720905874. First published February 17, 2020.
Keywords: Bias detection, data mining, Facebook pages, application programming interface, social media research

Since November 2017, Facebook has introduced a new limitation on the maximum amount of page posts retrievable through their Graph application programming interface (API). However, there is limited documentation on how these posts are selected (Facebook 2017). In this article, I assess the bias caused by the new limitation by comparing two datasets of the same Facebook page, a full dataset obtained before the introduction of the limitation and a partial dataset obtained after. To establish generalisability, I also replicate the findings with data from another Facebook page.

This paper demonstrates that posts with high user engagement, Photo posts and Video posts, are over-represented while Link posts are under-represented. Top-term analysis reveals that there are significant differences in the most prominent terms between the full and partial dataset. This paper also reverse-engineered the new application programming interface’s ranking algorithm to identify the features of a post that would affect its odds of being selected by the new API. The estimated model posits that post types, Likes, Angry, Shares, and Likes on Comment are significant predictors. Sentiment analysis reveals that there are significant differences in the sentiment word usage between the selected and non-selected posts.

These findings have significant implications for research that use Facebook page data collected after the introduction of the limitation:

The under-representation of Link posts means that a significant amount of link-sharing activities would become invisible from the API.
There is no evidence to support the common expectation that the API would rank posts based on the amount of Likes and Comments. While the selected posts seem to have more Likes and Comments, other features also have an effect on the odds of being selected.
It is questionable to assume that the new API would return all the posts with the highest user engagement. Even though it is observed that the selected posts on average have higher user engagement, some highly commented and liked posts might not be selected due to the effect of other features.
Posts of certain linguistic styles can be filtered out as the new API tends to return posts with more emotional texts.
Non-random factors might be influencing the representation of most prominent terms in the selected posts, which could lead to bias in text models.

However, it is important to note that the data retrieved from the Graph API is still a useful resource that enables a wide range of research methods. We should not stop using the data because of the above-mentioned issues. Echoing Rieder et al. (2015), the potential bias calls for caution, prudence, and critical attention when using and interpreting the data. Uncovering the bias of the ranking algorithm will help researchers to better support their research results.

References:

Facebook (2017) /page-id/feed. Available at: https://developers.facebook.com/docs/graph-api/reference/v2.11/page/ feed (accessed 31 January 2020).

Rieder B, Abdulla R, Poell T, et al. (2015) Data critique and analytical opportunities for very large Facebook pages: Lessons learned from exploring “We are all Khaled Said”. Big Data & Society 2(2): 1–22. https://doi.org/10.1177/2053951715614980

Saturday, 7 March 2020

How biased is the sample? Reverse engineering the ranking algorithm of Facebook’s Graph application programming interface