Structured Data on Commons, Part Two – Federated Wikibase and Multi-Content Revisions

Structured Data on Commons (SDC) relies on two other pieces of software to make it work: federated Wikibase and Multi-Content Revisions (MCR). Both of these things required a lot of time and resources to make them work, with federated Wikibase developed by Wikimedia Deutschland, and MCR development shared by several teams across the Wikimedia Foundation and Wikimedia Deutschland. MCR, as one of the most significant changes to MediaWiki in the past decade, underwent an extensive proposal-and-discussion period before development.

Wikibase is a free, open-source database software extension for MediaWiki developed to as part of the Wikidata project. Federated Wikibase allows a wiki with Wikibase installed (the client) to communicate back to a central Wikibase installation (the repository). In the context of the structured data project, the client, Commons, needs to be able to get and return information from the repository, Wikidata. The information Wikidata holds that Commons needs is the structure and relationship between concepts being described in files Commons. For example, if an image depicts a house cat, the labeling of “house cat” as depicted is stored on Commons, while the concept of “depicts” comes from Wikidata.

Federation means making calls from one wiki’s Wikibase instance to another to retrieve information, and those calls have the potential to end up affecting the performance of the websites. Making sure that structured data didn’t cause such a slow down was one of the challenges with federated Wikibase. Additionally, enabling cross-communication between Wikibase instances required a lot of new code changes that sometimes had surprising side effects on the host wiki. The development teams spent months finding and fixing all of these problems.

Multi-Content Revisions completely changes how pages are assembled and displayed to a user. On a wiki without MCR, a page revision – that is, the version of a page as it exists when edited and saved at a particular state in time – is stored and displayed as a single type of data, such as some form of wikitext mark-up, JSON, or Wikibase entries. Since SDC needs to store and display more than one type of revision at a time, software needed to be written to change how a page revisions work for Commons.

MCR restructures a part of how MediaWiki stores information, by adding a layer of “indirection” as a way to link between revisions with different data components. The additional layer required not only new server-side code and an interface, but a massive change in the data schema and accompanying migration to the new schema.

To the left, the previous method of calling a page revision. To the right, calling a page revision using MCR.

Since the way page revisions and the data contained were stored in the back-end, there was additional work to make this change on the front-end, facing Commons users.

  • Diff views have to work with multiple slots. A diff view details a revision of a page, and engineering had to be done to show diffs from multiple slots of a revision.
  • Multi-slot views were needed. This work parallels the work on diff views, and shows the content of all slots when viewing a revision of a page.
  • An extra slot had to be configured in order to store Wikibase entities on MediaWiki in addition to wiki-text.
  • MediaWiki extensions have to be compatible with MCR, and the tools had to be identified and updated. This ensures, for example, that the tools we use to prevent spam and other malicious behavior work with MCR.
  • MediaWiki’s internal search engine, Cirrus Search, had to be engineered to work with MCR. Cirrus Search can crawl each slot, which will surface the information there to the widest possible audience. The enables semantic search of structured, linked data in files.

All of this engineering for MCR and federated Wikibase had to be completed before the Structured Data team was able to release its features to Commons. The Structured Data team is very grateful to the Core Platform, Wikidata, and Search Platform teams for all their work to make structured data storable, displayable, and searchable on Commons. With the infrastructure they created, the SDC team can create more powerful structured data features for Commons contributors.

Next: Part Three – Multilingual File Captions

Previously: Structured Data on Commons – A Blog Series


Comments? Questions? Discuss this post