Big Data & Society: Automated Facts, Data Contextualization and Knowledge Colonialism: A Conversation Between Denny Vrandečić and Heather Ford on Wikipedia’s 20th Anniversary

At the start of 2021, contributors across the world will celebrate the 20-year anniversary of the first Wikipedia, and its 250,000 volunteers who supports it over 300 languages. In advance of Wikipedia’s 20th anniversary, Joseph Reagle, co-editor of Wikipedia @ 20: Stories of an Incomplete Revolution, 2020 MIT Press, facilitated a conversation between two of the volume’s essayists on the potential power and pitfalls of Wikipedia’s latest project: Abstract Wikipedia, an effort to automate the sharing of “facts”, information and data across different languages versions of the online encyclopedia.

Denny Vrandečić (founder of Wikidata and now Abstract Wikipedia, see FN1) and Heather Ford (scholar of Wikipedia governance) discuss the automation of digitised facts (e.g., Marie Curie’s birth year is 1867) as a means of reducing user effort and possible mistakes. Vrandečić and Ford discuss the two related concerns of when abstract “facts” are divorced from their context via automation and how the automated import of decontextualized data from larger projects can overwhelm smaller ones. Both of these concerns relate to the question of how automation influences peoples’ agency over the information that they produce at a local level. Automation may enhance efficiencies but this big data practice has important consequences for power and accountability within the world’s most widely used encyclopedia.

------ Context ------

Ford: One of the key dangers with the automated extraction of facts from Wikipedia is that it has resulted in the loss of citation data. On Wikipedia, there is a significant focus on citing statements to reliable sources and this work is seen as central to the principle of verifiability. But on Wikidata, many statements are only cited to a particular Wikipedia language version because of their mass import. This ultimately prevents readers being able to follow up the source of a statement and influences the accountability of Wikimedia. How will Abstract Wikipedia avoid this problem?

Vrandečić: The problem of unsourced statements is not new to Wikidata. Wikidata makes it much more visible, though. As of 2020, more than 65% of statements on Wikidata are sourced – this has gone up from less than 20% five years ago. Wikidata is improving considerably. I am actually convinced that the ratio of referenced statements on Wikidata is by now considerably better than on Wikipedia. But unfortunately, we are often compared with a perfect 100%, but I don’t think that is fair.

On Wikipedia it is much harder to count how many of its statements are actually sourced. But if you take a look at an average article on Wikipedia, you will find that far less than half of the statements in it have references – but we have no automatic metric counting this, so the actual comparison is rather hidden. I would love to see more research in this area.

Abstract Wikipedia will be more similar to Wikidata: it will be far easier to count how many statements have references than on Wikipedia, which will be the same blessing and curse as for Wikidata.

The other advantage is that due to the way Abstract Wikipedia is built, we will be able to generate sentences for each statement, and not just structured data as in Wikidata. This may in fact allow us to easier look for references in digital source material. Together with the fact that we can more easily see which statements are missing references, it should be possible to more easily create workflows that will search for possible references explicitly, and allow the community to fill the gaps.

Ford: You are right that the problem of unsourced statements is not new to Wikidata and it is true that it is much easier to figure out how much of Wikidata is unsourced than Wikipedia. But I don’t think that this is a reason to disavow the problem. The problem is significant: it influences Wikipedia’s core principles and affects how Wikipedia’s facts are ultimately understood – not as provisional statements that emerge from particular perspectives, but as hard facts that cannot be questioned. One could also argue that the problem is even more important to solve on Wikidata because of the authority that datafied facts have when they are extracted and presented on third party platforms.

I am also not as hopeful as you are that the problem is being solved over time. Having discussed this with others, I believe that the reason we are seeing improvement in the statistics is that WikiCite’s focus is on creating and maintaining source and citation data rather than annotating current Wikidata items. If you look at Wikidata items about places, Johannesburg, for example, you’ll notice that the majority of statements are not sourced meaningfully beyond a particular Wikipedia language version. A vast amount of manual labour is needed to do this work, and I just don’t see it happening anytime soon without a clear incentive and programme to do so at scale. I agree with you that we need more research in this area to look at the extent to which particular topics (such as places) are well-cited and how this has changed over time, but first it needs to be acknowledged that there is a potential problem that isn’t being dealt with.

I hope that Abstract Wikipedia prioritises and necessitates the addition of sources as new statements are added. If this can happen at the outset, with the small scale approach at the start, and if Abstract Wikipedia prioritises human review of sources, then it will certainly be an improvement on Wikidata.

Vrandečić: I also hope that the Abstract Wikipedia community will put the right emphasis on sourcing its content, but in the end this is a decision to be made by that community, and not by me.

I think what helped with improving the situation on Wikidata so much over the last few years was that once the concern was raised, dashboards and metrics were introduced which made the situation more visible. I think the fact that the coverage of statements with sources more than tripled since then is a testament to the fact that this problem has been recognized and has been worked on by the community, and that the community is in fact very capable to deal with such issues.

Because of that I plan to put the same trust in the future community that will work on Abstract Wikipedia. And I welcome researchers like you to critically observe and accompany that development, after all it was exactly these concerns that lead to the set up of such dashboards and making these numbers more observable.

------ Colonialism ------

Ford: Abstract Wikipedia is intended to help populate smaller Wikipedias. How will you prevent the English-speaking community and other dominant languages from drowning out local content? For example, the encyclopedia in the Cebuano language, spoken in the southern Philippines, is one of the largest because it has been flooded by thousands of automatically translated articles, which is difficult for the local Cebuano community to curate and maintain.

Vrandečić:As far as I can tell, the problem with the Cebuano Wikipedia is not the availability of content to readers of Cebuano – but that the Cebuano community has no control over this content, how much of it the community wants to integrate, how to filter the machine-generated content out, and how to change it – in short, how to take ownership of the content. These articles are not really owned and maintained by the Cebuano community, but by a single person, who, in dialog with the community, runs his bot to write these millions of articles.

But there is no proper path for the wider community to take ownership of this large set of articles. This is one thing Abstract Wikipedia aims to change: the content and the integration in the local communities will be entirely controlled by the local community. Not only whether they want to pull in Abstract Wikipedia wholesale or not, but in much more detail. And the community members will always be able to go to Abstract Wikipedia and work on the content itself. They are not just passive recipients of content, but they can make their voices heard and join in the creation and maintenance of the content across languages.

I agree with the sentiment of your question, and I do have a similar worry: it is imperative that Abstract Wikipedia does not become another tool for colonizing the knowledge sphere even more thoroughly. Even with the Cebuano community members contributing to Abstract Wikipedia, they will likely be outnumbered by community members from Europe and the US. What are your ideas that, if implemented, could help with Abstract Wikipedia becoming a true tool towards knowledge equity?

Ford: You’re right that the problem with Cebuano Wikipedia is not the availability of content but in the control over that content. What is interesting is that it was a Cebuano Wikipedian who built the first bot that “dump(ed) articles on all the communes of France (over 50,000) on Cebuano Wikipedia,” according to long-time Filipino Wikipedian Josh Lim.

Lim wrote that the result wasn’t only that “local content was drowned out,” but that it sparked “a numbers war among local Wikipedias” when “other Philippine-language Wikipedias tried to outdo one another in achieving high article counts.” The move to automate articles brought with it a set of values that prioritised quantity over quality, and prioritised subjects that could easily be automatically added (communes of France, for example) over locally relevant topics. It affected the ability of the small Cebuano community to grow its content organically and caused embarrassment when Filipino Wikipedians met other Wikipedians abroad as Cebuano Wikipedia was nicknamed “the Wikipedia of French communes.”

The answer, I think, is for us to think more carefully about what we mean when we say that the “community” will be able to “control” their Abstract Wikipedia project. What defines the Cebuano Abstract Wikipedia community? And how will they make decisions about the content in their repository? The problem with Cebuano Wikipedia was that concerned community members only realised the scale of the problem after the first bot had done its work and the editor running the bot had left the project. There was no consultation. And after the precedent had been set, editors consented to further bot work because they were in a race to increase the visibility of their Wikipedia project. The problem wasn’t only about the domination of Western knowledge. It was about the domination of a logic that recognises the number of articles as a heuristic for quality and/or as a way that small projects gain visibility on the project.

Is there a way that we can, at the outset of a project, think about progress beyond quantitative heuristics towards qualitative ones? This could mean focusing not on the numbers of articles created but on the development of local principles and policies, for example.

Can we start small and focus on inclusive deliberation with language communities beyond Wikimedia? This could mean working with only a few languages during an evaluation period in which there is a focus on changing the scope and venue for deliberation – beyond discussions with the few who have the interest, knowledge, and time to give their opinion on wiki, to equipping representatives from the language community with the knowledge necessary for them to make an informed, collective decision beyond the wiki.

Can we provide communities with the real ability to reject the project and at different stages, using the same principles of deliberation? This won’t be as easy as how most projects are launched and conducted, but it will reap returns because it will foreground planning and discussion and lead to a decision that can be ethically justified.

Finally, can we invite social scientists working with interested Wikimedians to study the project as it evolves, suggesting important research questions that you highlight here so that we can evaluate the extent to which Abstract Wikipedia fulfills the goals it seeks over time? A number of principles and evaluation frameworks are emerging to investigate the ethical impact of automated systems. There are some very promising evaluation frameworks and principles emerging from this research and it makes sense that Abstract Wikipedia engage with them.

Vrandečić: I would love for us to move from a metric of number of articles to a more meaningful metric, such as the number of contributors. I think this would be a much more meaningful metric reflecting the health of our projects.

I am wary of setting up deliberative bodies that stretch beyond Wikimedia to make decisions about the Wikimedia projects. One of our strengths is the autonomy of the projects. Setting up a body that is above the actual productive community working on the wiki and possibly dictating which kind of content should and should not be written sounds incompatible with the way our projects have worked so far. Inviting new people into the Wikimedia projects, yes, absolutely, but I think that in the end what happens in a Wikipedia language edition should be decided by the active community members of that given Wikipedia.

What I can commit to is to set up a forum to explicitly discuss the ethics and the implications of the Abstract Wikipedia project, and to send out a wider invitation beyond the current Wikimedia communities. This should allow us to identify potential problems and to either design solutions or avoid parts of the project altogether. One result would indeed be to make sure that the local communities retain control of the whole process of integrating content from Abstract Wikipedia, not just nominally but also practically and deliberately. I would be honored if you’d join us in this forum.

Ford: I get what you’re saying about the problems of non-Wikimedia editors making decisions that Wikimedians have to implement. But I believe that it is a mistake to limit deliberation about the future of Wikimedia projects to the wiki. It’s a mistake because what is presented as fact on Wikimedia isn’t just another representation acting as an equal player in a sea of alternative representations. Wikipedia and the Wikimedia projects that serve it are a dominant representation that affect people well beyond the wiki and around the world because it is increasingly recognised as a global consensus view about people, places, events and things around the world. That means that there are stakeholders well beyond those represented by a small Wikipedia community who should be a part of those discussions because it is those communities who will be affected by those representations. Expanding those discussions doesn’t necessarily mean that the result will be decisions that Wikimedians can’t implement. Recent innovations in deliberation design have demonstrated how this is, indeed, possible to achieve. I am really pleased that you will set up a forum to discuss the ethics and implications of the Abstract Wikipedia project and I look forward to finding ways to expand the deliberative scope of that conversation. Thank you for your invitation!

-----

FN1. Abstract Wikipedia is a provisional name for this new project and the selection of a final name is still underway.

Denny Vrandečić is a computer scientist who founded Wikidata in 2012 so that information could be shared across projects and languages. This year, Vrandečić followed up with Abstract Wikipedia for more complex expressions (note Abstract Wikipedia is a provisional name for this new project and the selection of a final name is still underway). Consider the statement: “Marie Curie is the only person who received two Nobel Prizes in two different sciences.” In his contribution to the book, he describes how such a claim might be represented in an abstract and computer-understandable way and then automatically translated across languages.

Heather Ford is a scholar of Wikipedia and Internet governance and author of the chapter Rise of the Underdog. In her forthcoming book about the Wikipedia “fact factory,” Ford traces the development of the 2011 Egyptian Revolution article to show how factual claims are conceived, contested, and constructed at Wikipedia before being used by the likes of Google and Bing. It is a story about how knowledge automation offers great efficiencies while challenging some of the ideals of the Internet as global public infrastructure.

For further reading on linked data and automation in the context of Wikipedia and Google, see Heather Ford’s 2016 work in collaboration with Mark Graham from the Oxford Internet Institute in a chapter for the book Code and the City edited by Kitchin and Perng called “Semantic Cities: Coded Geopolitics and the Rise of the Semantic Web” (pre-print), and a journal article for Environment and Planning D “Provenance, Power and Place: Linked Data and Opaque Digital Geographies” (pre-print).

Thursday, 10 December 2020

Automated Facts, Data Contextualization and Knowledge Colonialism: A Conversation Between Denny Vrandečić and Heather Ford on Wikipedia’s 20th Anniversary

------ Context ------

------ Colonialism ------