As the three-year grant period for building Structured Data on Commons (SDC) comes to a close with the end of 2019, I’d like to share some lists of the past two year’s worth of planning, discussion, building, testing, and releases the team has done with the Commons community.Continue reading “Wrapping up version one: Structured Data on Commons”
With depicts statements available to make the most basic claims about files on Commons, it was time to make more fully-formed statements. The Structured Data on Commons development team developed and released the first level of support for types of statements other than depicts.
“Other statements” offer expanded data about a file. Wikidata properties such as creator (P170), location (P276), Commons quality assessment (P6731), license (P257), and more. For an example of depicts plus other statements, here’s a file that is an image of sugar cubes:
This is the representation of the file in structured data, using depicts with qualifiers in combination with other statements:
This information is “machine-readable,” meaning that people can write software to interact with it, soon there will be the power to query the data, and a host of other potential uses. Lucas Werkmeister wrote a separate blog covering some of the possibilities of Structured Data on Commons. Importantly, all of this information is multilingual as well, as previously most data was restricted to English when used in templates and categories.
Taken as a whole, depicts and other statements, contributors to Wikimedia Commons can now begin to fully contribute structured data. The development team continues to work on support for different data beyond words, such as geocoordinates, time stamps, and other such types. Additional support for community tools such as Lua functionality is making progress as well. After this multi-year effort, the partners involved in the project can start the work of building a more accessible Commons at last.
Previously: Part Four – Depicts Statements
Now that the underlying software for Structured Data on Commons has been put in place, along with Captions helping to demonstrate the software worked, the development team was ready to release the first form of structured statements for Commons: depicts.
Depicts is a statement for representing the concepts or topics present or expressed in a media file. The depicts statement can be considered the most basic example for modeling information about a file.
With support for depicts, people searching for specific media files on Commons can begin finding them in a structured, multilingual way. At the time of release, depicts statements can be searched using the keyword
haswbstatement. For example, if you wanted to find all instances of depicts (P180) a house cat (Q146), in the search bar you can use:
haswbstatement:P180=Q146 and it will return results in all languages.
After making sure basic depicts support was working, the development team added support for qualifiers. By using qualifiers for depicts, users are able to represent the file even further by refining, contextualizing, or expanding the simple statement. For example, the previous statement of depicts (P180) a house cat (Q146) can be refined to depicts (P180) a house cat (Q146) [color: gray (Q42519)] and will return only files with statements that match a gray cat. As with basic depicts, this functionality is multilingual and will find whatever languages are available.
Now that Commons has the most basic modeling for data in a file in place, the development team turned to supporting other types of statements beyond depicts. These other types of statements will be covered in the next part.
Previously: Part Three – Multilingual File Captions
Wikimedia Commons holds over fifty million freely-licensed media files. These millions of images, sounds, video, documents, three-dimensional files and more contain a vast amount of information related to the contents of the file and the the context for the world around them. As Commons has collected files over the years, the volunteers who curate and maintain the site have developed a system to contain and present this information to the world, using MediaWiki, wikitext, and templates.
A description template is the first and primary way information about a file is show to users. These templates can be a powerful tool for displaying information about files; descriptions provide meaningful context and information about the work presented. Descriptions can be as long as the user would like, providing wikitext markup and links for others to find out more. Description templates can also hold translations by adding language fields. However, the Structured Data team saw some areas that a feature like captions could improve upon from descriptions templates.
Multilingual captions help share the burden of descriptions by providing a space to describe a file in a way that is standard across all files, easy to translate, and easy to use. Captions do not support wikitext so there is no knowledge needed of how to links work in this space — links can still be provided in the more expansive file description. Captions are added during the upload process using the UploadWizard, or they can be added directly on any file page on Commons. The translation feature for captions is a simple interface that requires only a few steps to create and share a caption translation.
The “multilingual” in “multilingual captions” highlights a primary focus of Structured Data features: opening up access to Commons to as many languages as possible beyond its present capabilities. This is enormously beneficial to the Wikimedia movement and Wikimedia Foundations’ mission of sharing knowledge with the world. In addition to captions, future features planned provide supporting adding “statements” from Wikidata to files, effectively describing them in an organized way that can be accessed by programs and bots to present media. These statements can be multilingual as Wikidata supports translations, which will make statements searchable in any language that has a translation provided.
Structured Data on Commons (SDC) relies on two other pieces of software to make it work: federated Wikibase and Multi-Content Revisions (MCR). Both of these things required a lot of time and resources to make them work, with federated Wikibase developed by Wikimedia Deutschland, and MCR development shared by several teams across the Wikimedia Foundation and Wikimedia Deutschland. MCR, as one of the most significant changes to MediaWiki in the past decade, underwent an extensive proposal-and-discussion period before development.
Wikibase is a free, open-source database software extension for MediaWiki developed to as part of the Wikidata project. Federated Wikibase allows a wiki with Wikibase installed (the client) to communicate back to a central Wikibase installation (the repository). In the context of the structured data project, the client, Commons, needs to be able to get and return information from the repository, Wikidata. The information Wikidata holds that Commons needs is the structure and relationship between concepts being described in files Commons. For example, if an image depicts a house cat, the labeling of “house cat” as depicted is stored on Commons, while the concept of “depicts” comes from Wikidata.
Federation means making calls from one wiki’s Wikibase instance to another to retrieve information, and those calls have the potential to end up affecting the performance of the websites. Making sure that structured data didn’t cause such a slow down was one of the challenges with federated Wikibase. Additionally, enabling cross-communication between Wikibase instances required a lot of new code changes that sometimes had surprising side effects on the host wiki. The development teams spent months finding and fixing all of these problems.
Multi-Content Revisions completely changes how pages are assembled and displayed to a user. On a wiki without MCR, a page revision – that is, the version of a page as it exists when edited and saved at a particular state in time – is stored and displayed as a single type of data, such as some form of wikitext mark-up, JSON, or Wikibase entries. Since SDC needs to store and display more than one type of revision at a time, software needed to be written to change how a page revisions work for Commons.
MCR restructures a part of how MediaWiki stores information, by adding a layer of “indirection” as a way to link between revisions with different data components. The additional layer required not only new server-side code and an interface, but a massive change in the data schema and accompanying migration to the new schema.
Since the way page revisions and the data contained were stored in the back-end, there was additional work to make this change on the front-end, facing Commons users.
- Diff views have to work with multiple slots. A diff view details a revision of a page, and engineering had to be done to show diffs from multiple slots of a revision.
- Multi-slot views were needed. This work parallels the work on diff views, and shows the content of all slots when viewing a revision of a page.
- An extra slot had to be configured in order to store Wikibase entities on MediaWiki in addition to wiki-text.
- MediaWiki extensions have to be compatible with MCR, and the tools had to be identified and updated. This ensures, for example, that the tools we use to prevent spam and other malicious behavior work with MCR.
- MediaWiki’s internal search engine, Cirrus Search, had to be engineered to work with MCR. Cirrus Search can crawl each slot, which will surface the information there to the widest possible audience. The enables semantic search of structured, linked data in files.
All of this engineering for MCR and federated Wikibase had to be completed before the Structured Data team was able to release its features to Commons. The Structured Data team is very grateful to the Core Platform, Wikidata, and Search Platform teams for all their work to make structured data storable, displayable, and searchable on Commons. With the infrastructure they created, the SDC team can create more powerful structured data features for Commons contributors.
Previously: Structured Data on Commons – A Blog Series
Wikimedia Commons is the freely-licensed media repository hosted by the Wikimedia Foundation. Started in 2004, Commons contains over 50 million files—all of which are meant to contain educational value and help illustrate other Wikimedia projects such as Wikipedia. As with all Wikimedia projects, the content is created, curated, and edited by volunteers. In addition to the content work on the wikis, the Commons community participates in organizing and running thematic media contribution campaigns around the world such as Wiki Loves Monuments, Wiki Loves Food, and Wiki Loves Africa.
Structured Data on Commons (SDC) is a three-year software development project funded by the Sloan Foundation to provide the infrastructure for Wikimedia Commons volunteers to organize data about media files in a consistent, linked manner. The goals of the project are to make contributing to Commons easier by providing new ways to edit, curate, and write software for Commons, and to make general use of Commons easier by expanding capabilities in search and reuse. These goals will be served by improved support for multilingual content and ways of working on Commons. This is the first in a blog series that will document the different parts of implementing SDC, starting with this introduction to the project and brief outlines of the software involved in making it happen, each to be covered more in-depth later.
Part One – an introduction to the software
Commons is built on MediaWiki, the same software used by the other Wikimedia projects. MediaWiki was primarily developed to host text. Because of this, information about files on Commons is stored in plain-text descriptions (wikitext, templates) and categories. The information includes at least the uploader, author, source, date, and license of a file, with many other optional items. These pieces of data are usually only available in one language—mostly English—and, most importantly, not structured in a way that software developers can consistently write programs to understand the data that is stored in file pages. Data that is structured in a consistent, understandable way is called “machine-readable,” and having machine-readable data is a primary goal for the Structured Data on Commons project.
In order to provide this consistent, machine-readable data, the information needs to be stored in a database instead of plain-text in MediaWiki. Wikibase is the software solution for that need. Wikibase is the software that enables MediaWiki to store structured data or access data that is stored in a structured data repository, developed by Wikimedia Deutschland to support Wikidata. The project needed a way to use Wikibase on other wikis and connect the information back to Wikidata, a feature which had recently been developed. Called Federated Wikibase, this software is crucial to organizing media information on Commons.
The next piece of software needed was Multi-Content Revisions (MCR). MCR is a way of putting a wiki page together that needs to pull information from different places with different ways of storage—in other words, MCR can assemble information from both MediaWiki and Wikibase to be displayed and managed together. More information about Federated Wikibase and MCR will be covered in a future post in this series.
Once Federated Wikibase and MCR were ready for release, the Structured Data on Commons team produced the first user-facing feature to use the new underlying software: multilingual file captions. Captions—stored in Wikibase—have a similar function to the description template used on file pages, which is stored in MediaWiki; they both are supposed to say what is in the file. However, descriptions are not limited in length, they may contain extra detail not necessary to finding the file including wikilinks and external links, and while the template supports adding extra languages, the process is not necessarily easy. Captions support an easier way to add other languages and captions are limited in length and should describe the file only in a short, factual way. This makes files with captions easier to find in search in a structured, multilingual way for both humans and software programs alike.
After releasing Wikibase and MCR to Commons with captions to make sure it all worked, the development team put out support for the first structured statement type, “depicts.” Depicts statements make simple, factual claims about the content of a file and link to their matching concept on Wikidata. To further develop depicts statements, support for qualifiers was released as well. Qualifiers allow depicts statements to have more information about what is being depicted. So for example, a picture of a black house cat can have the structured statement
depicts: house cat[color:black]. Depicts statements are on a new tab that was introduced to the file page, “Structured data.” Aside from captions, all structured data is on this tab.
After this short introduction, the SDC blog series will have further information about depicts and qualifiers, as well as support for making other types of statements about files.