Open Government Data: The Book

By Joshua Tauberer. Second Edition: 2014.
Also available as a Paperback and for Kindle. Tweet me at @JoshData.

Online and Free, Primary, Timely, Accessible (Principles 1–4)

(1) Information is not meaningfully public if it is not available on the Internet for free.

Today, the first place many people turn for information is the Web and they expect to find government information there. If information can be obtained only by request in person, the information is essentially unavailable to the vast majority of citizens. Likewise, any fee for access greatly limits the availability of the information.

This principle was adapted from Sunlight Foundation’s Principles for Transparency in Government (February 2009)1 and the “access” requirement of the Open Knowledge Foundation’s Open Definition at (and reproduced in Open Knowledge Definition). Sunlight Foundation’s Open Data Policy Guidelines (2014) says to “proactively release government information online,” and the G8 Open Data Charter2 says that data should be free of charge.

It is rapidly becoming suspect for government records, especially those that are relevant to government transparency, to be made available to the public only in person.

The federal Office of the Federal Register has been quite frank about how things were just a few years ago:

The physical version of the [Public Inspection] Desk inside the OFR [Office of the Federal Register] office near Capitol Hill was a battered old table with documents piled into wooden boxes. For 73 years we offered shoe-leather access — if you worked inside the DC Beltway, and let’s say you wanted to know how the Government was reacting to a financial crisis, you could hoof it over to the OFR and look for documents in the emergency filing box.

You might stand in line to read an item, then wait for a photocopier and hope it didn’t break down. You could try agency websites, but you could not rely on that material. It might not be current, and only the OFR had the original, which may have been modified by the agency at the OFR after our legal review ensured that effective dates made sense and CFR [Code of Federal Regulations] amendments were properly stated.3

They have since brought those documents online as part of Federal Register 2.0. But those conditions still exist elsewhere in the federal government. The House of Representative’s Legislative Resource Center, located in the basement of one of the House office buildings, is another one of those locations. And while much of their records began to go online in 2009 under pressure from the public, some still remains in print only, including samples of franked mail which are required to be submitted by congressmen for review.

What constitutes an appropriate fee for reuse of government information varies from culture to culture, and this principle may certainly be biased toward U.S. culture. In the United States expectations are particularly high for government transparency. Fees beyond the marginal cost of reproducing a document are viewed with suspicion, as if the fee is designed to impinge on the public’s ability to oversee its government. The European Union Public Sector Information Directive (EU PSI Directive), as updated in 2013, requires most EU government bodies to charge for data at no more than the marginal costs of “reproduction, provision, and dissemination”.4 Fortunately there is essentially no marginal cost of online distribution for most government records, and so even for allowing marginal costs “public” means “online” and “free”.

An open government working group convened by Carl Malamud in November 2007 was the first to attempt to define open government data. Its 8 Principles of Open Government Data5, included in full in 8 Principles and online at, specified a working definition for what it means for public government data to be open. The second through eighth principles below are taken directly from the 2007 working group’s definition.6 Data should be:

(2) “Primary: Primary data is data as collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.”

This principle relates to the change in emphasis from providing government information to information consumers to providing information to mediators, including journalists, who will build applications and synthesize ideas that are radically different from what is found in the source material. While information consumers typically require some analysis and simplification, information mediators can achieve more innovative solutions with the most raw form of government data.

One often finds that the only open access to audio, video, and images are at low resolutions for the purpose of making them suitable to viewing on a website. While this is an important use case, publishers of open data have an obligation to make the full-resolution information available in bulk to support additional applications such as the creation of professional media and archiving. That may be in addition to making available a low-resolution format. For instance, Congressional committees generally offer live low-resolution streaming video of committee events, and some committees additionally offer separate access to high-resolution archival footage.7 In the case of government reports, separate documents should make available any underlying data used in the analysis. For instance, the U.S. Census Bureau publishes reports (i.e. PDFs) about the nation’s demographics as well as comprehensive, raw, tabular data downloads that researchers can analyze for their own reports on other subjects.

In other words, when applying these principles to media such as documents and audio/visual recordings, one must consider the dual role of media: On the one hand, as part of the agency’s website they are a component of the agency’s communications strategy. Because of this, web media must be available in formats suitable for display in a web browser and should be easily locatable through search. But — on the other hand — web media is often also a government record. Reports and video archives, for instance, are of interest not just to visitors to the agency’s website but also to journalists and technologists who may want to analyze them in ways not supported by a method of publishing intended for a web audience. For instance, web video is often played at a lower resolution than what was originally recorded. And documents may be displayed as a digitally signed PDF for ease of authentic reading and printing, but other source formats may be more useful for research. Web media must often be made available in multiple formats suitable for these different purposes.

Granularity and the use of multiple formats are also mentioned in the G8 Open Data Charter8.

(3) “Timely: Data are made available as quickly as necessary to preserve the value of the data.” Data is not open if it is only shared after it is too late for it to be useful to the public.

What is a reasonable level of timeliness depends on the nature of the data set. Data relevant to an ongoing policy debate requires higher standards of timeliness. Timeliness is not just that data is available once, but that data users can find updates quickly. RSS feeds can help notify users of new content, and data should explicitly include a list of recent changes to the format and content.

The American Association of Law Libraries’s Principles & Core Values Concerning Public Information on Government Websites9 notes that information can be current but users of the information may be unable to tell. It is as important for users to know the data is current as for the data itself to be current. Their principles state, “Government websites must provide users with sufficient information to make assessments about the accuracy and currency of legal information published on the website.”

When timeliness is in competition for quality, strike a balance or work iteratively. The UK Open Data Whitepaper (2012) states:

(8) Public data will be timely and fine-grained. The Government’s approach to Open Data is not limited to the publication of aggregate data long after the events to which it relates. . . . (9) Release data quickly, and then work to make sure that it is available in open standard formats, including linked data forms.10

Timeliness is also mentioned in the G8 Open Data Charter11.

(4) “Accessible: Data are available to the widest range of users for the widest range of purposes.”

The accessibility principle covers a wide range of concerns including the need for the user of the data to be able to locate, interpret, and understand it and through software to be able to acquire and decode it. The G8 Open Data Charter12 reused this accessibility language verbatim.

The choice of data format has wide implications for what applications can be built on top of the data, what usage restrictions may result from data format patents, and whether archived data is likely to be usable in the future when we may not have access to the same software we do today. Data must be made available in formats that support both intended and unintended uses of the data by being published with current industry standard protocols and formats, preferably open, non-proprietary protocols and formats. This principle is also related to the Open Definition’s “access” and “absence of technological restriction” requirements.

If the data is accessible through an interactive interface, it must also be possible to download the complete data set in raw form and in bulk through an automated process (i.e. a bulk data download). If the data set is distributed across multiple locations, for instance if the requirement of publication is spread across multiple agencies or offices, then the automated part becomes much more important, since parts of the complete data are far less useful in isolation. The ability to locate parts of a data set is called discoverability and is strengthened by techniques such as well-known locations, sitemaps, and use of common formats and standards. Common methods of making data available are simple links to downloadable files, the use of an anonymous FTP (File Transfer Protocol) server, and for data with frequent updates an rsync or version control server (e.g. Subversion or Git).

Data must be provided with sufficient documentation so that the data user understands the structure of and abbreviations in the data. Documentation may assume some level of subject domain expertise but should not assume knowledge of internal agency practices. The G8 Open Data Charter13 summarizes the goals of documentation:

make sure that data are fully described, so that consumers have sufficient information to understand their strengths, weaknesses, analytical limitations, and security requirements, as well as how to process the data

On the House Statement of Disbursements website, discussed in greater detail in House Disbursements, there is a thorough explanation of how expenditures are tracked by the House and a separate glossary of 42 terms that are either obscure or are used in unusual ways by the House accounting system.14

  1. Link no longer available.

  2. G8 Open Data Charter and Technical Annex, 2013.


  4. The original 2003 EU PSI directive allowed government bodies to increase fees to obtain a “reasonable return on investment.” The 2013 update constrains fees for most government bodies and also added a requirement that fees be transparent. For more on the changes, see Ton Zijlstra and Katleen Janssen. April 19, 2013. The new PSI Directive: as good as it seems?

  5. which I helped write

  6. I omit the group’s first principle, completeness, as it seems redundant.

  7. Nick Judd. January 6, 2012. Watching Them Watching: Issa Touts Video Archive of Oversight Hearings. TechPresident.

  8. G8 Open Data Charter and Technical Annex, 2013.

  9. March 24, 2007.


  11. G8 Open Data Charter and Technical Annex, 2013.

  12. G8 Open Data Charter and Technical Annex, 2013.

  13. G8 Open Data Charter and Technical Annex, 2013.