1. Principles 1–4: The Basic Principles

1. Information is not meaningfully public if it is not available on the Internet for free.

Today, the first place many people turn for information is the Web and they expect to find government information there. If information can be obtained only by request in person, the information is essentially unavailable to the vast majority of citizens. Likewise, any fee for access greatly limits the availability of the information.

It is rapidly becoming suspect for government records, especially those that are relevant to government transparency, to be made available to the public only in person.

The federal Office of the Federal Register has been quite frank about how things were just a few years ago:

The physical version of the [Public Inspection] Desk inside the OFR [Office of the Federal Register] office near Capitol Hill was a battered old table with documents piled into wooden boxes. For 73 years we offered shoe-leather access — if you worked inside the DC Beltway, and let’s say you wanted to know how the Government was reacting to a financial crisis, you could hoof it over to the OFR and look for documents in the emergency filing box.

You might stand in line to read an item, then wait for a photocopier and hope it didn’t break down. You could try agency websites, but you could not rely on that material. It might not be current, and only the OFR had the original, which may have been modified by the agency at the OFR after our legal review ensured that effective dates made sense and CFR [Code of Federal Regulations] amendments were properly stated.[133]

They have since brought those documents online as part of Federal Register 2.0. But those conditions still exist elsewhere in the federal government. The House of Representative’s Legislative Resource Center, located in the basement of one of the House office buildings, is another one of those locations. And while much of their records began to go online in 2009 under pressure from the public, some still remains in print only, including samples of franked mail which are required to be submitted by congressmen for review.

What constitutes an appropriate fee for reuse of government information varies from culture to culture, and this principle may certainly be biased toward U.S. culture. In the United States expectations are particularly high for government transparency. Fees beyond the marginal cost of reproducing a document are viewed with suspicion, as if the fee is designed to impinge on the public’s ability to oversee its government. Fortunately there is essentially no marginal cost of online distribution for most government records, and so even for allowing marginal costs “public” means “online” and “free”.

The European Union Public Sector Information Directive (EU PSI Directive) sets a much lower standard: “(14) Where charges are made, the total income should not exceed the total costs of collecting, producing, reproducing and disseminating documents, together with a reasonable return on investment.” Although the directive does go on to recommend the marginal cost, the rule it actually sets allows EU governments to use public data for profit. (This perhaps is changing as the value of open data becomes better recognized. In a December 2011 speech by the vice president of the European Commission and commissioner for its Digital Agenda, the benefit of marginal cost was reaffirmed.[134])

This first principle was adapted from Sunlight Foundation’s Principles for Transparency in Government[135] and the “access” requirement of the Open Knowledge Foundation’s Open Knowledge Definition (OKD) at opendefinition.org (and reproduced in section 7.2). The recommendations below continue with those published by the Open Government Working Group (opengovdata.org), convened by Carl Malamud in November 2007. Here are its principles two through four.[136] Data should be:

2. “Primary: Primary data is data as collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.”

This principle relates to the change in emphasis from providing government information to information consumers to providing information to mediators, including journalists, who will build applications and synthesize ideas that are radically different from what is found in the source material. While information consumers typically require some analysis and simplification, information mediators can achieve more innovative solutions with the most raw form of government data.

One often finds that the only open access to audio, video, and images are at low resolutions for the purpose of making them suitable to viewing on a website. While this is an important use case, publishers of open data have an obligation to make the full-resolution information available in bulk to support additional applications such as the creation of professional media and archiving. That may be in addition to making available a low-resolution format. For instance, Congressional committees generally offer live low-resolution streaming video of committee events, and some committees additionally offer separate access to high-resolution archival footage.[137] In the case of government reports, separate documents should make available any underlying data used in the analysis. For instance, the U.S.Census Bureau Census Bureau publishes reports (i.e. PDFs) about the nation’s demographics as well as comprehensive, raw, tabular data downloads that researchers can analyze for their own reports on other subjects.

In other words, when applying these principles to media such as documents and audio/visual recordings, one must consider the dual role of media: On the one hand, as part of the agency’s website they are a component of the agency’s communications strategy. Because of this, web media must be available in formats suitable for display in a web browser and should be easily locatable through search. But web media is often also a government record. Reports and video archives, for instance, are of interest not just to visitors to the agency’s website but also to journalists and technologists who may want to analyze them in ways not supported by a method of publishing intended for a web audience. For instance, web video is often played at a lower resolution than what was originally recorded. And documents may be displayed as a digitally signed PDF for ease of authentic reading and printing, but other source formats may be more useful for research. Web media must often be made available in multiple formats suitable for these different purposes.

3. “Timely: Data are made available as quickly as necessary to preserve the value of the data.” Data is not open if it is only shared after it is too late for it to be useful to the public.

What is a reasonable level of timeliness depends on the nature of the data set. Data relevant to an ongoing policy debate requires higher standards of timeliness. Timeliness is not just that data is available once, but that data users can find updates quickly. RSS feeds can help notify users of new content, and data should explicitly include a list of recent changes to the format and content.

The American Association of Law Libraries’s Principles & Core Values Concerning Public Information on Government Websites[138] notes that information can be current but users of the information may be unable to tell. It is as important for users to know the data is current as for the data itself to be current. Their principles state, “Government websites must provide users with sufficient information to make assessments about the accuracy and currency of legal information published on the website.”

4. “Accessible: Data are available to the widest range of users for the widest range of purposes.” The accessibility principle covers a wide range of concerns including the need for the user of the data to be able to locate, interpret, and understand it and through software to be able to acquire and decode it.

The choice of data format has wide implications for what applications can be built on top of the data, what usage restrictions may result from data format patents, and whether archived data is likely to be usable in the future when we may not have access to the same software we do today. Data must be made available in formats that support both intended and unintended uses of the data by being published with current industry standard protocols and formats, preferably open, non-proprietary protocols and formats. Open formats tend to have lower barriers to use and also ensure that we have the knowledge to be able to decode the data when the current software for that format is no longer available. This principle is also related to the OKD’s “access” and “absence of technological restriction” requirements.

If the data is accessible through an interactive interface, it must also be possible to download the complete data set in raw form and in bulk through an automated process (i.e. a bulk data download). If the data set is distributed across multiple locations, for instance if the requirement of publication is spread across multiple agencies or offices, then the automated part becomes much more important, since parts of the complete data are far less useful in isolation. The ability to locate parts of a data set is called discoverability and is strengthened by techniques such as well-known locations, sitemaps, and use of common formats and standards. Common methods of making data available are simple links to downloadable files, the use of an anonymous FTP (File Transfer Protocol) server, and for data with frequent updates an rsync or version control server (e.g. Subversion or Git).

Data must be provided with sufficient documentation so that the data user understands the structure of and abbreviations in the data. Documentation may assume some level of subject domain expertise but should not assume knowledge of internal agency practices. On the House Statement of Disbursements website, there is a thorough explanation of how expenditures are tracked by the House and a separate glossary of 42 terms that are either obscure or are used in unusual ways by the House accounting system.[139]

An “API” provides access to slices of data and often requires registration first (see principle 6 below), so an API typically does not meet the requirements of access. The exception is when a data set is so large so as to be not practically downloadable in bulk. By today’s standards, that would be a data set at least 10 gigabytes in size, or about 6 hours on a broadband connection. Of course an API may still be very useful, but bulk data should be available first. (The Australian Governments Open Access and Licensing Framework makes an API a core part of data access, placing its recommendation under the heading “open query.” This is a much higher standard of technological openness than most of the other principles in this section, and in fact the framework goes so far as to recommend the use of a SPARQL endpoint — SPARQL is the query language for RDF, Linked Open Data, and the Semantic Web.[140] However, SPARQL is a largely untested approach. A good example of an API to use as a model is Sunlight Foundation’s Real Time Congress API.)

If you like this book, please consider buying a copy:

Support independent publishing: Buy this book on Lulu.

Subscribe to updates to the book:
Google Groups
Read comments or add a comment on this book.