Structured Data on Commons, Part Two – Federated Wikibase and Multi-Content Revisions

Structured Data on Commons (SDC) relies on two other pieces of software to make it work: federated Wikibase and Multi-Content Revisions (MCR). Both of these things required a lot of time and resources to make them work, with federated Wikibase developed by Wikimedia Deutschland, and MCR development shared by several teams across the Wikimedia Foundation and Wikimedia Deutschland. MCR, as one of the most significant changes to MediaWiki in the past decade, underwent an extensive proposal-and-discussion period before development.

Wikibase is a free, open-source database software extension for MediaWiki developed to as part of the Wikidata project. Federated Wikibase allows a wiki with Wikibase installed (the client) to communicate back to a central Wikibase installation (the repository). In the context of the structured data project, the client, Commons, needs to be able to get and return information from the repository, Wikidata. The information Wikidata holds that Commons needs is the structure and relationship between concepts being described in files Commons. For example, if an image depicts a house cat, the labeling of “house cat” as depicted is stored on Commons, while the concept of “depicts” comes from Wikidata.

Federation means making calls from one wiki’s Wikibase instance to another to retrieve information, and those calls have the potential to end up affecting the performance of the websites. Making sure that structured data didn’t cause such a slow down was one of the challenges with federated Wikibase. Additionally, enabling cross-communication between Wikibase instances required a lot of new code changes that sometimes had surprising side effects on the host wiki. The development teams spent months finding and fixing all of these problems.

Multi-Content Revisions completely changes how pages are assembled and displayed to a user. On a wiki without MCR, a page revision – that is, the version of a page as it exists when edited and saved at a particular state in time – is stored and displayed as a single type of data, such as some form of wikitext mark-up, JSON, or Wikibase entries. Since SDC needs to store and display more than one type of revision at a time, software needed to be written to change how a page revisions work for Commons.

MCR restructures a part of how MediaWiki stores information, by adding a layer of “indirection” as a way to link between revisions with different data components. The additional layer required not only new server-side code and an interface, but a massive change in the data schema and accompanying migration to the new schema.

To the left, the previous method of calling a page revision. To the right, calling a page revision using MCR.

Since the way page revisions and the data contained were stored in the back-end, there was additional work to make this change on the front-end, facing Commons users.

  • Diff views have to work with multiple slots. A diff view details a revision of a page, and engineering had to be done to show diffs from multiple slots of a revision.
  • Multi-slot views were needed. This work parallels the work on diff views, and shows the content of all slots when viewing a revision of a page.
  • An extra slot had to be configured in order to store Wikibase entities on MediaWiki in addition to wiki-text.
  • MediaWiki extensions have to be compatible with MCR, and the tools had to be identified and updated. This ensures, for example, that the tools we use to prevent spam and other malicious behavior work with MCR.
  • MediaWiki’s internal search engine, Cirrus Search, had to be engineered to work with MCR. Cirrus Search can crawl each slot, which will surface the information there to the widest possible audience. The enables semantic search of structured, linked data in files.

All of this engineering for MCR and federated Wikibase had to be completed before the Structured Data team was able to release its features to Commons. The Structured Data team is very grateful to the Core Platform, Wikidata, and Search Platform teams for all their work to make structured data storable, displayable, and searchable on Commons. With the infrastructure they created, the SDC team can create more powerful structured data features for Commons contributors.

Next: Part Three – Multilingual File Captions

Previously: Structured Data on Commons – A Blog Series


Structured Data on Commons – A Blog Series

One hell of a mess....jpg
File:One_hell_of_a_mess….jpg () by Tom Cronin, CC-BY-SA-2.0. It’s a picture of a big analog synthesiser…

Wikimedia Commons is the freely-licensed media repository hosted by the Wikimedia Foundation. Started in 2004, Commons contains over 50 million files—all of which are meant to contain educational value and help illustrate other Wikimedia projects such as Wikipedia. As with all Wikimedia projects, the content is created, curated, and edited by volunteers. In addition to the content work on the wikis, the Commons community participates in organizing and running thematic media contribution campaigns around the world such as Wiki Loves Monuments, Wiki Loves Food, and Wiki Loves Africa.

Structured Data on Commons (SDC) is a three-year software development project funded by the Sloan Foundation to provide the infrastructure for Wikimedia Commons volunteers to organize data about media files in a consistent, linked manner. The goals of the project are to make contributing to Commons easier by providing new ways to edit, curate, and write software for Commons, and to make general use of Commons easier by expanding capabilities in search and reuse. These goals will be served by improved support for multilingual content and ways of working on Commons. This is the first in a blog series that will document the different parts of implementing SDC, starting with this introduction to the project and brief outlines of the software involved in making it happen, each to be covered more in-depth later.

Wikidata for Commons-logo.svg
File:Wikidata_for_Commons-logo.svg ( (upload date)) by , PD ineligible.

Part One – an introduction to the software

Commons is built on MediaWiki, the same software used by the other Wikimedia projects. MediaWiki was primarily developed to host text. Because of this, information about files on Commons is stored in plain-text descriptions (wikitext, templates) and categories. The information includes at least the uploader, author, source, date, and license of a file, with many other optional items. These pieces of data are usually only available in one language—mostly English—and, most importantly, not structured in a way that software developers can consistently write programs to understand the data that is stored in file pages. Data that is structured in a consistent, understandable way is called “machine-readable,” and having machine-readable data is a primary goal for the Structured Data on Commons project.

In order to provide this consistent, machine-readable data, the information needs to be stored in a database instead of plain-text in MediaWiki. Wikibase is the software solution for that need. Wikibase is the software that enables MediaWiki to store structured data or access data that is stored in a structured data repository, developed by Wikimedia Deutschland to support Wikidata. The project needed a way to use Wikibase on other wikis and connect the information back to Wikidata, a feature which had recently been developed. Called Federated Wikibase, this software is crucial to organizing media information on Commons.

The next piece of software needed was Multi-Content Revisions (MCR). MCR is a way of putting a wiki page together that needs to pull information from different places with different ways of storage—in other words, MCR can assemble information from both MediaWiki and Wikibase to be displayed and managed together. More information about Federated Wikibase and MCR will be covered in a future post in this series.

Once Federated Wikibase and MCR were ready for release, the Structured Data on Commons team produced the first user-facing feature to use the new underlying software: multilingual file captions. Captions—stored in Wikibase—have a similar function to the description template used on file pages, which is stored in MediaWiki; they both are supposed to say what is in the file. However, descriptions are not limited in length, they may contain extra detail not necessary to finding the file including wikilinks and external links, and while the template supports adding extra languages, the process is not necessarily easy. Captions support an easier way to add other languages and captions are limited in length and should describe the file only in a short, factual way. This makes files with captions easier to find in search in a structured, multilingual way for both humans and software programs alike.

After releasing Wikibase and MCR to Commons with captions to make sure it all worked, the development team put out support for the first structured statement type, “depicts.” Depicts statements make simple, factual claims about the content of a file and link to their matching concept on Wikidata. To further develop depicts statements, support for qualifiers was released as well. Qualifiers allow depicts statements to have more information about what is being depicted. So for example, a picture of a black house cat can have the structured statement depicts: house cat[color:black]. Depicts statements are on a new tab that was introduced to the file page, “Structured data.” Aside from captions, all structured data is on this tab.

Depicts with a qualifier.png
File:Depicts_with_a_qualifier.png () by User:Keegan (WMF), CC-BY-SA-3.0.
English: An example of a depicts statement with a qualifier.

After this short introduction, the SDC blog series will have further information about depicts and qualifiers, as well as support for making other types of statements about files.

Next: Part Two – Federated Wikibase and Multi-Content Revisions