Open Government Data Maturity Model

We live in a resource-constrained world where it is important to consider not just what we want, i.e. the principles of open government data, but also what we are willing to give up to get it. This chart provides a road map for deciding between the many aspects of open government data.

Going down the rows on the left side are the different technological strategies of open government data. Across the columns along the top are the different sorts of public information governments produce. The rows and the columns in this chart have an order. Some open government data projects should take precedence over others. Rows above should come before rows below. Columns to the left should come before columns to the right. (At least, roughly.) Don’t run before you can walk.

Law Services Operations Public Data
FOI
Freedom of Information laws create a legal right to government documents and a presumption of openness.
Online & Free
Information is not meaningfully public if it is not available on the Internet and free.
Put the law online. Use the Internet for service delivery. Post meeting notices online. Make other data available.
Open
Meets 7 Principles.
Complete. Primary. Timely. Accessible. Non-discriminatory. Non-proprietary. License-free.
Law cannot be monopolized through copyright. Post information for everyone to register to vote. Create a gov’t org chart, post decisions in a timely manner. The public’s data should not require proprietary tools.
Structured
Analyzable. Processable.
Data is structured to allow automated processing.
Syntactic markup for law. Metadata for legislation. Digital signatures. Turn registration requirements into XML. Database of MPs, voting records in XML, upcoming meetings in RSS. Data files for public health surveys, IRS 990s, other collateral data.
Global IDs
URLs & URIs.
Place documents at permanent URLs and assign globally-unique identifiers in the form of URIs to...
Titles, chapters, sections, paragraphs. Government services and processes. MPs, government agencies, and contractors. Geographic locations, transit stops.
APIs
Provides random or write access.
Create web/REST-based dynamic access points for large datasets that answer real-world questions.
What is 17 USC 1201? What law and legislation affects me? This is “Government as a Platform.” Let third-parties intermediate. What government spending is near me? Public access to records management systems.
Linked Data
Lives on the semantic web.
The semantic web turns the web into an interconnected database, revealing new insights that cross-cut information silos.
Semantic markup of what law means. Re-use API protocols. Link spending data to corporate ownership records. Re-use existing schemas.

Click a cell in the maturity model to see more information about it.

Law

The leftmost column is “Law”, and here the maturity model asserts that access to the law is the most important function of the many purposes open government data serves.

A moral imperative to promulgate the law in all of the ways that increase access stems from the principle that ignorance of the law is never a defense. The principle is quite a conundrum when the law is hard to find, difficult to understand, and, at times, illegal to share. Examples are the text of the United States Code, judicial dockets, and perhaps information on potential laws (i.e. bills and regulatory proposals).

This moral imperative is only a starting point. Access to law has wider implications, as Carl Malamud writes on law.resource.org: improved civics and law education in schools, deeper research in universities, innovation in the legal information market, savings to the government, reduced costs of legal compliance for small business, and greater access to justice. Free public access to legal materials isn’t intended to necessarily replace the expensive subscription services for legal professionals, but instead to open up legal materials to a new audience.

Services

Services are data produced in the furtherance of a government program. Weather data is an example. The National Weather Service is, or at least was at one time, the largest producer of public data in the government. The Census was one of the first agencies to put data on the web. This sort of data is produced and distributed as part of the agencies' core missions.

For service-related data there is no moral imperative to make the data available, but there is a legal imperative to further a public policy goal. If an agency’s mission is to produce information, publishing that information as open data can help it further its mission and achieve the goals that we, as a society, have prioritized for our government.

Operations

The next column is ``Operations.'' This sort of data is information about how government works, how it is being run, and how money is being spent. This is where government accountability looks for corruption, for instance, and it's where we find out who represents us in government, who is making decisions, how representatives voted, and so on.

Only an educated public can hold its government accountable. This idea is rooted in the United States's Bill of Rights, it is a common understanding of the role of journalism in society, and, even if that were not all true, it is a moral underpinning of democratic government.

Public Data

Last is the catch-all column “Public Data.” This is, for instance, some sorts of Medicare and Medicaid claim statistics. Or geographic data about the location of every single road in the country. This is data for which there is no moral imperative to make public, at least not the sort of moral imperative that exists for law data, and there is no legal imperative to pro-actively make it available either.

In a resource-limited world, this sort of data is not a high priority for open data. But making the data open, structured, and so on produces value to society. It is civic capital. Entrepreneurs can build businesses around this data. (Think Google Maps and its predecessors, built originally off of government data and government GPS signals.)

Freedom of Information

The legal right of FOI creates a presumption of openness, but, as you know if you’re familiar with FOIA in the United States, the right is not pro-active, it’s reactive. If there’s data you want, and you can figure out which agency has it, you can petition for that information. And if you’re lucky, the agency won’t object and claim one of the exemptions, if you’re lucky the agency won’t make you pay much to have the data retrieved and copied, and if you’re lucky you’ll get it in about a year.

Almost 50 years after FOIA was enacted, it’s pretty obvious we can do a lot better. The rows below this one build technology on top of the principle of freedom of information.

Online and Free

This principle says that while FOI provides a mechanism for making information public, information is not meaningfully public until it can be found on the Internet free of charge (except for the cost of Internet access).

Uploading data and documents, in whatever form they already are in, is the first technological step in the maturity model.

Adapted from Sunlight Foundation’s Principles for Transparency in Government.

Open

Eight core principles, plus six other principles, determine whether government data can be considered “open”. The “Open” row in the maturity model refers to seven of the core principles of open government data: it is complete, primary, timely, accessible, non-discriminatory, in non-proprietary formats, and license-free. (This maturity model leaves machine processability for the next row.)

This row, as with FOI, is primarily a matter of policy. In this case, technology policy like timeliness and license restrictions impact the usefulness of data --- especially whether data can be used meaningfully to keep government accountable or affect policy change.

See the 8 Principles of Open Government Data, written by the Open Government Working Group convened by Carl Malamud in November 2007, and the Open Knowledge Foundation’s Open Knowledge Definition (OKD) at opendefinition.org. Also see my book.

Structured

This is the first row that is purely technical, and it refers to creating data in such a way as to make it searchable, sortable, transformable, or, to put it generally, analyzable and machine-processable.

Use spreadsheets instead of PDFs, use text instead of scanned images. Use XML. Break down fields into processable components. Applying structure to data requires an up-front technical investment but pays off by making the data more valuable. In this row, the open data that is published online is the original spreadsheet, an SQL database dump, or bulk XML data.

For more, see my book and the machine-processable principle of the 8 Principles of Open Government Data.

Global IDs

``Global IDs'' means globally unique identifiers. This is a type of structure that can be added to data. This concept is that any document, resource, data record, or entity mentioned in a database, or some might say every paragraph in a document, should have a unique identification that others can use to point to or cite it elsewhere.

There are many advantages of globally unique identifiers. Identifiers make information findable. For instance, a citation to a paragraph in the law (such as “22 U.S.C. 3301(b)(6)”) is a sort of identifier. The identifier uniquely pinpoints a paragraph in the United States Code.

When identifiers are shared across data silos, they create connections and make the data more adaptable. This is especially important for government spending data, where contract awardees might also be campaign contributors. A shared numeric identifier for each corporation facilitates a connection between these two typically separate databases. The value of the two connected datasets becomes more than just the sum of their parts. When identifiers persist across database versions, users of the database can process the changes from version to version more easily, making connections across time.

A web address (or URL) is a globally unique identifier. Any web address refers to that document and nothing else, and this reliability promotes the dissemination of the document as it provides a means to refer to and direct people to it. A commitment to use a particular web address is the basis for permanent links.

Modern globally unique identifiers are URLs that not only identify but also provide enough information to find information about the subject of the identifier on the Internet. For instance, http://www.law.cornell.edu/uscode/text/22/3301#b-6 is a globally unique identifier to a paragraph in the U.S. Code. An easy (and accepted) way to choose a globally unique identifier is to piggy-back off of your agency's web domain, which provides a ``space'' of identifiers that won't clash with anyone else's. For instance, you may coin http://www.youragency.gov/id/john_smith without any risk that someone in a different agency will use the same identifier to refer to something else.

See my book for more explanation on permanence and identifiers.

APIs

The combination of structured data and URIs is a read-only, web-based API. An API is defined by an agreement between a provider and a consumer about where and how to access a service. In the context of open data, APIs refer to web-based query system where a consumer application connects to government web server and asks it for information in a dataset.

While an API cannot exist without structured data and a URL, APIs often provide much more functionality beyond a simple read of a resource. They often provide live (or on-demand) services such as sorting and filtering lists, joining tables, and transforming outputs into multiple formats. APIs may also provide transactional services (such as voter registration).

Because APIs are live, it is considerably harder to implement a properly functioning API than it is to implement structured data or Global IDs. Structured data can be as simple as a file uploaded once. It is static. APIs are dynamic, are expected to have low response times, and are expected to have “high availability,” which means the service is expected to be running, and running fast, at all times. High-availability also makes changing the structure of data more difficult because the API must serve “version 1” and “version 2” API users simultaneously while the “version 1” API goes through a process of deprecation. All of this requires not only technical expertise from multiple sorts of technology professionals (developers, information architects, and systems administrators) and a large up-front cost in building the API, but it also requires indefinite ongoing operational costs.

Additionally, APIs alone typically do not meet the principles of open government data. APIs often require registration first (violating principle 6) and often do not provide data in bulk (violating principle 4). (When a dataset is so large so as to be not practically downloadable in bulk, an API would be acceptable. By today's standards, that would be a data set at least 10 gigabytes in size.) Of course an API may still be very useful, but bulk data should be available first. (The Australian Governments Open Access and Licensing Framework makes an API a core part of data access, placing its recommendation under the heading ``open query'' and before ``open bulk supply,'' which I believe is a poor strategy.

Linked Open Data

The final row of the maturity model is “Linked Open Data” (LOD, linkeddata.org), which is little more than a thorough application of structure, Global IDs, and APIs. Beyond this, linked data uses a particular file format called RDF and a particular API protocol called SPARQL. Linked data provides a high degree of interconnectedness across data silos in both the objects mentioned in the data (e.g. government contractors) but also in the concepts that relate the objects together (so-called predicates).

Promoted by the creator of the Word Wide Web, Tim Berners-Lee, the Linked Open Data method for publishing databases achieves data openness in a standard format and the potential for interconnectivity with other databases without the expense of wide agreement on unified inter-agency or global data standards. Linked Data is a practical implementation of Semantic Web ideas, and several tools exist to expose legacy databases and spreadsheets in the LOD method. Though I have been writing about the uses of the Semantic Web for government data for as long as I've been publishing legislative data, it has not caught on in the United States, though it has become a core part of Data.gov.uk and is a recommendation of the Australian Governments Open Access and Licensing Framework.

The W3C working draft Publishing Open Government Data and the Linked Data Cookbook published by the W3C Government Linked Data committee provide additional best practices with regard to globally unique identifiers and Linked Open Data.

As with structure, linked data requires careful work and an investment up-front, but it provides a basis, a unified framework, for answering complex questions that span data sources and even entire domains. This creates a level of adaptability far beyond what is possible in previous rows. But linked data is still an experimental technology.

As a map of the open government data field, this chart expands on previous work by others in mapping the data and people in the open government data community. Yu and Robinson (2012) (“The New Ambiguity of ‘Open Government’ ”) proposed a map with a horizontal axis from service delivery to public accountability and a vertical axis from inert data (think PDFs or audio records) to adaptable data (meaning structured data). Their horizontal axis appears in the columns above disguised as “Services” and, for accountability, “Operations”. But I have added to that axis new columns on either side. And I have, in a sense, divided their vertical axis into discrete technologies that range from inert (freedom of information) to adaptable (linked data).

The rows of the maturity model are also similar to Tim Berners-Lee’s 5 ★ Open Data [original proposal], a grading scheme meant to encourage governments to move toward linked open data. While the 5 stars are a fine roadmap for linked open data specifically, that model does not address important questions in the wider open government data movement (such as when to deploy APIs).