Joanne Gray, Nicolas Suzor
Big Data & Society 7(1), https://doi.org/10.1177/2053951720919963. First published April 28, 2020.
Keywords: machine learning, copyright enforcement, YouTube, content moderation, automated decision-making, Content ID
How can we understand how massive content moderation systems work? Major social media platforms use a combination of human and automated processes to efficiently evaluate the content that their users post against the rules of the platform and applicable laws. These sociotechnical systems are notoriously difficult to understand and hold to account.
In this study, we use digital methods to try to make the content moderation system on YouTube—a system that relies on both automated and discretionary decision-making and that is applied to varying types of video content—more legible for research.
Starting with a random sample of 76.7 million YouTube videos, we used the BERT language model to train a machine learning classifier to identify videos in categories that reflect ongoing controversies in copyright takedowns. Our four categories were full movies, gameplay, sports content, and ‘hacks’ (tutorials on copy control circumvention). We used this classifier to isolate and categorise a sample of approximately 13 million unique videos for further analysis.
For each of these videos, we were able to identify which videos had been removed, what reasons YouTube gave to explain their removal and, in some cases, who had requested the removal. We sought to compare trends in copyright takedowns, Content ID blocks, and terms of service removals across different categories of content. Across our entire dataset, videos were most frequently removed from YouTube by users themselves, followed by removals due to an account termination and then Content ID blocks. DMCA takedowns were the least common removal type.
One of the most important findings of this study is that we can see, at a large scale, the rates at which Content ID is used to remove content from YouTube. Previous large-scale studies have provided important insights into rates of DMCA removals but information about Content ID removals has remained imprecise, provided at a high level of abstraction by YouTube. In this paper, we provide the first large-scale systematic analysis of Content ID and Terms of Service removal rates, including comparisons with other removal types and across different categories of content.
Our analysis provides a comparison of different types of automation in content moderation, and how these different systems play out across different categories of content. We found very high rates of removals for videos associated with film piracy and this category had the highest rate of removals for terms of service violations and account terminations. Contrastingly, we found low rates of removals for game play. It appears from our data that game publishers are largely not enforcing their rights against gameplay streams and when a gameplay video is removed it is usually due to a claim by a music rightsholder.
For sports content, we found very high rates of removals for both live sporting streams and sports highlights, including high rates of terms of service violations. For this category of content it appears both copyright owners and YouTube are highly active in policing sports content on YouTube. For the ‘hacks’ category, we found high rates of removals but mostly for terms of service violations indicating that YouTube, rather than rightsholders, are more commonly taking action to remove content and terminate accounts that provide DRM circumvention information.
Overall, we found in YouTube’s heavily automated content moderation system there is substantial discretionary decision-making. We welcome the capacity for the use of discretion in automated systems. But it matters who is making the decisions and why. As nations pressure platforms to take a more active role in moderating content across the internet we must continue to advance our methods for holding decision-makers to account.
This experimental methodology has left us optimistic about the potential for researchers to use machine learning classifiers to better understand systems of algorithmic governance in operation at a large scale and over a sustained period.